Hadoop is a project under Apache. It consists of HDFS, mapreduce, hbase, hive, Zookeeper, and other Members. HDFS and mapreduce are two of the most basic and important members.
HDFS is an open-source version of Google gfs. It is a highly fault-tolerant distributed file system that provides high-throughput data access and is suitable for storing massive (Pb-level) data) (usually more than 64 MB), the principle is as follows:
The Master/Slave structure is used. Namenode maintains metadata in the cluster and provides functions for creating, opening, deleting, and renaming files or directories. Datanode stores data and initiates read/write requests for processing data. Datanode periodically reports heartbeat to namenode. namenode controls datanode by responding to heartbeat.
Infoword named mapreduce the top 10 emerging technologies in 2009. Mapreduce is a powerful tool for large-scale data (Tb-level) computing. Map and reduce are the main ideas of mapreduce.Programming Language, As shown in the following figure:
MAP is used to scatter data and reduce is used to aggregate data. You only need to implement the map and reduce interfaces to compute TB-level data. common applications include: log analysis, data mining, and other data analysis applications. In addition, it can also be used for scientific data computing, such as calculating the circumference rate pi.
The implementation of hadoop mapreduce also adopts the Master/Slave structure. Master is called jobtracker, while slave is called tasktracker.
The calculation submitted by the user is called a job. Each job is divided into several tasks. Jobtracker is responsible for scheduling jobs and tasks, while tasktracker is responsible for executing asks.