1.Hadoop Development History
1.1 Hadoop generates a background
Hadoop first originated in Nutch. Nutch is an open source web search engine, started in 2002, Nutch's design goal is to build a large-scale full web search engine, including crawling Web pages, indexes, queries, etc., but as the amount of data increased, encountered an expansion problem. Until 2003, Google published a Google file system GFS, which describes the Google search engine Web page data storage architecture, which solves the problems encountered by Nutch, and then implemented its own distributed file system, namely NDFs ( Google here is just open source thinking, but not open source code, to 2004 years of Google's another paper mapreduce:simplified data 處理 on SCM Cluster, shocked the world, This paper describes the framework of distributed computing, but also only open source ideas, not open source code, Nutch developers on their own implementation, due to NDFs and MapReduce's success, 2006 Nutch developers, then move it out of the Nutch, become a Lucene subproject , called Hadoop (supposedly the name of the toy elephant of Doug Cutting's son), with the development of Hadoop, Hadoop has become the top project of the Apache Foundation in 2008 and has also developed other projects in the Hadoop family.
The source of 1.2 Hadoop
The idea of Hadoop comes mainly from Google, where Google's two big papers gfs,mapreduce play a decisive role, while Google's low-cost path (no supercomputers, no storage, lots of PC servers, redundant cluster services ...) is also the root of success. Google's success has also benefited from the page rank algorithm.
More details about page rank can be referred to: http://blog.csdn.net/v_july_v/article/details/6142146
Hadoop is now developing rapidly, is the implementation of cloud computing standard Open-source software, has been able to run on the thousands of nodes, processing data volume and speed has a significant effect. At the same time, the Hadoop family project has been developed accordingly.
2 Hadoop Architecture
Two big pillars of 2.1 Hadoop: HDFs and MapReduce
The HDFS is used for distributed storage of large scale data, while MapReduce is built on HDFS and distributed computing of data.
2.1 HDFs Architecture
HDFs is a highly fault-tolerant distributed system, suitable for deployment on low-cost machines, HDFS can provide high throughput data access, suitable for large data set applications, mainly Client,namenode,secondarynamenode, Datanode several components.
Client: HDFs files are accessed interactively through the Namenode to the Datanode.
The Namenode:hdfs daemon, a "master", is a single point for how records files are divided into blocks of data, which nodes are stored, and which are centrally managed for IO and memory.
Secondarynamenode: A secondary daemon that monitors the status of HDFs, each cluster has one, communicates with Namenode, periodically stores HDFs metadata snapshots, and provides backup when Namenode fails.
DataNode: Responsible for reading and writing the HDFS data blocks to the local file system, and periodically reporting the data to namenode,datanode with a fixed-size block (default 64MB) as the base unit.
2.2 MapReduce Architecture
Like HDFs, Mr also uses the Master/slave mode, which has the following components: Client,jobtracker,tasktracker,task.
Jobtracker: Responsible for resource monitoring and job scheduling, running the master node, HDFs Master, deciding which file an to participate in processing, then split task and assign nodes, restart failed task, etc.
Tasktracker: Located on the slave node, in conjunction with the Datanode, manages tasks on their respective nodes, interacts with jobtracker (periodically reports Resource usage and task progress to heartbeat through Jobtracker). Tasktracker provides the use of map tasks and reduce tasks by using the "slot" isometric division (partitioning the user's own decision) on the amount of resources on the node, divided into Mapslot and Reduceslot.
Task:map task and reduce task, Tasktracker started.
The map task process: The map task first parses the corresponding split (MapReduce processing unit) iteration into a key/alue pair, calls the MAP function to process it, and eventually saves the temporary result to the local disk, and its temporary data is divided into several partition. Each partition will be handled by a reduce task.
The reduce task process: reads the partition section, sorts the results, reads sequentially, invokes the user's own reduce function, and saves the result to HDFs.
Reference books: Hadoop combat, Hadoop Technology in-depth analysis of MapReduce architecture design and implementation principles.
dataguru:http://www.dataguru.cn/forum.php