What is Hadoop: Hadoop is a software platform for developing and running large-scale data processing. It is an open-source software framework implemented by Appach in java, it enables distributed computing of massive data in a cluster composed of a large number of computers.
The core designs in the Hadoop framework are HDFS and MapReduce. HDFS provides massive data storage, while MapReduce provides data computing.
The process for processing data in Hadoop can be simply understood as follows: the data is processed by the Haddop cluster and the result is obtained.
HDFS: Hadoop Distributed File System, which is a Distributed File System of Hadoop.
Large files are divided into 64 mb data blocks by default and stored in cluster machines.
For example, the file data1 is divided into three parts, which are distributed on different machines in the form of redundant images.
MapReduce: Hadoop creates a task for each input split and calls Map computing. in this task, records are processed in sequence. map outputs the results in the form of key-value, hadoop sorts map output by key value as Reduce input, and the output of Reduce Task is the output of the entire job, which is saved on HDFS.
Hadoop clusters are mainly composed of NameNode, DataNode, Secondary NameNode, JobTracker, and TaskTracker.
As shown in:
NameNode records how files are split into blocks and the blocks are stored on those DateNode nodes.
NameNode also saves the running status information of the file system.
DataNode stores the split blocks.
Secondary NameNode helps NameNode collect the status information of the file system.
When a Job is submitted to a Hadoop cluster, JobTracker is responsible for running the Job and scheduling multiple tasktrackers.
TaskTracker is responsible for a map or reduce task.