1. MapReduce definition
The MapReduce in Hadoop is a simple software framework based on the applications it writes out to run on a large cluster of thousands of commercial machines, and to process terabytes of data in parallel in a reliable, fault-tolerant way
2. MapReduce Features
Why is MapReduce so popular? Especially now the Internet + ERA, Internet + companies are using MapReduce. The reason why MapReduce is so popular is that it has several features.
- MapReduce is easy to program . It simply implements a number of interfaces, it can complete a distributed program, the distributed program can be distributed to a large number of inexpensive PC machine operation. That means you write a distributed program that is identical to writing a simple serial program. It is because of this feature that the MapReduce programming becomes very popular.
- Good extensibility . When your computing resources are not met, you can expand the computing power by simply adding machines.
- High fault tolerance . The purpose of the MapReduce design is to enable the program to be deployed on a cheap PC machine, which requires a high level of fault tolerance. For example, if one of the machines is hung up, it can move the above computing task to a different node, so that the task does not fail, and the process does not require manual involvement, but is done entirely internally by Hadoop.
- suitable for offline processing of massive amounts of data above petabytes . Here the red font is processed offline, which means it is suitable for offline processing and not suitable for online processing. For example, if you return a result like a millisecond level, MapReduce is difficult to do. The
MapReduce has a lot of advantages, but it also has a place that is not good at it. Here is not good does not mean that it can not do, but in some scenarios to achieve poor results, not suitable for MapReduce to deal with, mainly in the following aspects.
- Real-time calculation . MapReduce cannot return results in milliseconds or seconds, like Mysql.
- streaming calculation . The input data is dynamic when streaming, and the input data set of MapReduce is static and cannot be changed dynamically. This is because the design features of the MapReduce itself determine that the data source must be static.
- DAG (directed graph) calculation . A dependency exists for multiple applications, and the input of the latter application is the previous output. In this case, MapReduce is not unable to do, but after use, each MapReduce job output will be written to disk, resulting in a lot of disk IO, resulting in very low performance.
3. The architecture of MapReduce
Like HDFs, MapReduce is also a master/slave architecture, and its architecture diagram is shown below.
The MapReduce consists of four components, client, Jobtracker, Tasktracker, and task, which are described in detail in the following four components.
1) Client Clients
Each Job will be stored in HDFS using the client class to package the application and configuration parameters into a JAR file, and submit the path to the Jobtracker Master service, and then the master creates each Task (i.e., maptask and reducetask) to distribute them to the various Tasktracker services to execute.
2) Jobtracker
Jobtracke is responsible for resource monitoring and job scheduling. Jobtracker monitors the health of all tasktracker and jobs, transfers the corresponding tasks to other nodes once the discovery fails, and jobtracker tracks the progress of the task, the amount of resources used, and tells the Task Scheduler The scheduler chooses the appropriate task to use these resources when the resource is idle. In Hadoop, the Task Scheduler is a pluggable module that allows the user to design the appropriate scheduler to suit their needs.
3) Tasktracker
Tasktracker periodically reports the usage of resources on this node and the progress of tasks to jobtracker through heartbeat, and receives the commands sent by Jobtracker and performs the corresponding actions (such as starting new tasks, killing tasks, etc.). Tasktracker uses "slot" to divide the amount of resources on this node equally. "Slot" stands for compute resources (CPU, memory, etc.). Once a task acquires a slot to run, the Hadoop Scheduler is used to assign the free slots on each tasktracker to the task. Slots are divided into the map slot and the reduce slot two, which are used by map task and reduce task respectively. Tasktracker limits the concurrency of a task by the number of slots (configurable parameters).
4) Task
The task is divided into map task and reduce task two, all of which are started by Tasktracker. HDFS stores data in a fixed-size block as the base unit, whereas for MapReduce the processing unit is split.
The map task executes as shown: The map task first parses the corresponding split iteration into a Key/value pair, calls the user-defined map () function, and then stores the temporary result on the local disk. Where temporary data is divided into several partition, each partition will be processed by a reduce Task.
The Reduce Task execution process is shown. The process is divided into three stages:
① reads the map Task intermediate result from the remote node (called the "Shuffle phase");
② sorts the key/value pairs according to key (called the "sort Stage");
③ reads < key, value List>, calls the user-defined Reduce () function to process and saves the final result to HDFs (known as the "Reduce phase").
4. MapReduce internal Logic
Below we analyze the data processing process of mapreduce through the internal logic of MapReduce. Let's take wordcount as an example and look at the internal logic of MapReduce, as shown in.
The approximate process for the internal logic of MapReduce is mainly done in the following steps.
1. First, the data in HDFS is Split as the input of the MapReduce. As we mentioned earlier, the data in HDFs is stored as a block, and how does this turn into split as input? In fact, block is the term in HDFS, and Split is the term in MapReduce. By default, a Split can correspond to a block and, of course, multiple blocks, and the correspondence between them is determined by InputFormat. By default, Textinputformat is used, when a split corresponds to a block. Suppose there are 4 blocks, 4 split, Split0, Split1, Split2, and Split3, respectively. At this time through the InputFormat to read each split inside the data, it will parse the data into one (Key,value), and then to the already written mapper function to deal with.
2. Each mapper will parse the input (key,value) data into words and word frequency, such as (a,1), (b,1) and (c,1) and so on.
4. In the reduce phase, each reduce is required to shuffle read its corresponding data. After all the data has been read, it is sorted by sort and then handed to Reducer for statistical processing. For example, the first reducer reads two (a,1) key-value pairs of data and then makes a statistical result (a,2).
5, the Reducer processing results, in OutputFormat data format output to the various file paths of HDFS. Here the OutputFormat default is textoutputformat,key for words, value is the word frequency number, and the delimiter between key and value is "\tab". As shown, (a 2) output to Part-0, (b 3) output to Part-1, (c 3) output to Part-2.
Deep understanding of the architecture and principles of MapReduce