Tuesday, October 8, 2013

What is Hadoop?

Need: As we are generating more and more data everyday we need tools to deal with this huge scale. The tools can be programming languages, software or infrastructure. Obviously the infrastructure is the foundation for everything else. 

Short history: Google, one of the internet giants faced large scale computation problem pretty early. So Google researchers like Sanjay Ghemawat and Jeff Dean did some work to solve this issue. They published two research papers on Google File System (2003) and MapReduce (2004). These papers were basically trying to solve large scale computations in distributed manner.  In 2005 two researchers Doug Cutting (from Yahoo) and Mike Cafarella build something based on GFS and MapReduce called Hadoop.



Hadoop basics: The Hadoop file system is called HDFS, Hadoop Distributed File System. It deals with group of machine called clusters which are part of distributed computing environment. Every single machine is referred as node. This HDFS has minimum amount of data chunk which should be stored called as block size, the default size is 64 mb. So if we have 640 mb of data then it will be divided over 10 nodes for storing as default block size is 64 mb. There are two types of nodes called name node and data node. Name node will decide details about splitting and storing the data. Name node also manages the metadata and tree hierarchy associated with it. Data node will actually have the data chunks physically stored on them. In a single cluster we will have only one name node and multiple data nodes. 

Pitfalls: Based on discussion above we can say the name node acts as authority or gateway for accessing the data stored on data nodes. Now if name node is down then we can not access the data associated with its data nodes. This is one of the reasons why Hadoop is called single point of failure. This can be avoided by having secondary name node, which acts as backup if primary name node goes down. Hadoop was also built for batch processing and not real-time processing. Though few versions offer the real-time processing capability its not available with Apache Hadoop as of now.

How it is used: Now that we have discussed high-level concepts of infrastructure lets see how it is used. Companies dealing with huge amount of data will configure the Hadoop clusters with distributed data. For example, the retail giant Walmart deals with 250 node Hadoop cluster. So far we are done with the data storing and distribution part. Next thing is to run jobs (programs) on top of it which will do the actual computation. MapReduce or similar techniques will be used for this.




No comments:

Post a Comment