What is Hadoop? Hadoop is a software platform for developing and running large scale data, and it is a Appach open source software framework in Java language to realize distributed computing of massive data in a large number of computer clusters.
The most central design in the Hadoop framework is that HDFS and Mapreduce.hdfs provide storage of massive amounts of data, and MapReduce provides computing for the data.
The process of data processing in Hadoop can be understood simply in terms of the following figure: The data is processed by the HADDOP cluster and the results are obtained.
Hdfs:hadoop the Distributed File system Distributed file System,hadoop.
The large file is divided into the default 64M piece of data block distribution stored in the cluster machine.
The file data1 in the following figure is divided into 3 blocks, and the 3 pieces are distributed in a redundant mirror in different machines.
Mapreduce:hadoop for each input split create a task call map calculation, in which the record is processed sequentially in this split, and the map outputs the result in key--value form. Hadoop is responsible for the output of the map as the input of reduce after the key value, and the output of the reduce task is the output of the entire job, stored on the HDFs.
The cluster of Hadoop is mainly composed of Namenode,datanode,secondary Namenode,jobtracker,tasktracker.
As shown in the following illustration:
The Namenode records how files are split into blocks and that these blocks are stored in those datenode nodes.
Namenode also holds the state information of the file system running.
The blocks stored in the Datanode is the split.
Secondary Namenode helps Namenode to collect state information about file system runs.
Jobtracker is responsible for job execution when a task is submitted to the Hadoop cluster, and is responsible for scheduling multiple tasktracker.
Tasktracker is responsible for a map or reduce task.