Hadoop: A Detailed Explanation of the Working Mechanism of MapReduce

Source: Internet
Author: User
Keywords mapreduce hadoop big date storage
Tags hadoop mapreduce writable interface big date storage map function
In the latest Hadoop (version 2.7 and above), we have not found the jobtracker and tasktracker modules in the console. This is not to say that they disappeared, but implicitly added to the YARN framework, the specific functions are integrated. And optimization. However, understanding the principles and features of the mapreduce method running on it will help us understand the latest Hadoop and help us understand this efficient distributed parallel framework.

The storage and processing of big data is like the right hand of a person, which is especially important. Hadoop is more suitable for solving big data problems, and relies heavily on its big data storage system, namely HDFS and big data processing system. For MapReduce, we know a few questions.

What is MapReduce
Hadoop MapReduce is a software framework that makes it easy to write applications that run on large clusters of thousands of commercial machines and process them in parallel in a reliable, fault-tolerant manner. Massive data sets on the TB level. This definition has these keywords, one is the software framework, the second is parallel processing, the third is reliable and fault-tolerant, the fourth is large-scale clustering, and the fifth is massive data sets. Therefore, for MapReduce, it can be succinctly considered that it is a software framework, and massive data is its "dish", which "cooks this dish" in parallel in a reliable and fault-tolerant manner on a large-scale cluster. Written here, the author sincerely laments the greatness of thought, the magic of decomposition, and the ingenuity of merger.

For example, if the final exam is over, the exam papers will be revised by different teachers. If you want to know the highest score of the whole year after completion, you can do this:

1) Each teacher sorts out all the test scores that have been corrected (map): =>(course,[score1,score2,...])
2) Each teacher reports the highest score to the department head (shuffle)
3) Department Chief Statistics (reduce) => (courese, highest_score)

Of course, if you want to mix multiple courses, the department head is too heavy, so the deputy director is also on (equivalent to 2 reduce), when the teacher reports the highest score, the same course should be reported to the same person (same key transmission) Give the same reduce), such as Math English, to the director, and the political report to the deputy director.


What can MapReduce do?
MapReduce is good at handling big data. Why does it have this capability? This can be detected by the design ideas of MapReduce. The idea of MapReduce is to divide and conquer. Mapper is responsible for "dividing", which is to break down complex tasks into several "simple tasks" to deal with. “Simple task” has three meanings: one is that the scale of data or calculation is much smaller than the original task; the other is the principle of near calculation, that is, the task is assigned to the node where the required data is stored; the third is these small Tasks can be calculated in parallel with little dependencies on each other. The Reducer is responsible for summarizing the results of the map phase. As for how many Reducers are needed, the user can set the value of mapred.reduce.tasks in the mapred-site.xml configuration file according to the specific problem. The default value is 1.

The entire working process of MapReduce is shown in the figure above. It contains the following four independent entities.
Entity 1: The client is used to submit MapReduce jobs.
Entity 2: jobtracker, used to coordinate the operation of the job.
Entity 3: Tasktracker, used to handle the task after the job division.
Entity 4: HDFS, used to share job files between other entities.
By reviewing the MapReduce workflow diagram, it can be seen that the entire work process of MapReduce includes the following work steps in an orderly manner.
Session 1: Submission of assignments
Session 2: Initialization of the job
Session 3: Assignment of tasks
Session 4: Execution of the task
Session 5: Process and status updates
Session 6: Completion of the assignment
For details on what to do in each link, you can read the contents of the MapReduce working mechanism in Chapter 6 of the Hadoop Authoritative Guide. For users, if you want to use MapReduce to process big data, you need to write MapReduce applications according to your needs. Therefore, how to use the MapReduce framework to develop programs is something that requires deep thinking and continuous practice.

What are the features of MapReduce?
MapReduce abstracts complex parallel computing processes running on large clusters into two functions: Map and Reduce.
Easy to program, do not need to master the details of distributed parallel programming, you can easily run your own programs on a distributed system to complete the calculation of massive data.
MapReduce uses a "divide and conquer" strategy. A large-scale data set stored in a distributed file system is divided into a number of independent slices that can be processed in parallel by multiple Map tasks.
One idea of MapReduce design is "computing close to data" rather than "data is close to computing" because mobile data requires a lot of network transmission overhead.
The MapReduce framework uses the Master/Slave architecture, including a Master and several Slaves. Run JobTracker on Master, run TaskTracker on Slave
The Hadoop framework is implemented in Java, but MapReduce applications do not have to be written in Java.

How MapReduce works
How does MapReduce handle big data? Users can implement big data operations by editing MapReduce applications. Since MapReduce is used to process big data, how does the MapReduce program work? This is the third problem, the working mechanism of MapReduce.

MapReduce mainly consists of the following four parts:

1) Client
• User-written MapReduce program submitted to the JobTracker via Client
• Users can view the running status of the job through some interfaces provided by the client.
2) JobTracker
• JobTracker is responsible for resource monitoring and job scheduling
• JobTracker monitors the health of all TaskTrackers and Jobs, and if it fails, transfers the corresponding tasks to other nodes.
• JobTracker will track the progress of the task, resource usage and other information, and tell the task scheduler (TaskScheduler), and the scheduler will select the appropriate task to use these resources when the resource is idle.
3) TaskTracker
• TaskTracker periodically reports the usage of resources on the node and the running progress of the task to the JobTracker through “heartbeat”, and receives the commands sent by the JobTracker and performs corresponding operations (such as starting new tasks, killing tasks, etc.)
• TaskTracker uses “slot” to divide the amount of resources (CPU, memory, etc.) on this node. A Task has a chance to run after getting a slot, and the Hadoop scheduler is used to assign idle slots on each TaskTracker to the Task. The slot is divided into Map slot and Reduce slot, which are used by MapTask and Reduce Task respectively.
4) Task
Task is divided into Map Task and Reduce Task, both started by TaskTracker

The core of these is the mapper and reducer, which implement the specific functions of the class, as shown in the following table.

Function
Input
Output
Map
<k1,v1>
Such as:
<line number, 'a b c>>

List(<k2,v2>)
Such as:
<"a", 1>
<"b", 1>
<"c", 1>
Reduce
<k2,List(v2)>
Such as: <"a", <1,1,1>>
<B 3, and 3>
<"J", 3>

MapTask parallelism d ecision mechanism
The degree of parallelism of the maptask determines the degree of processing of the task in the map phase, which in turn affects the processing speed of the entire job. The parallelism of the map phase of a job is determined by the client when submitting the job, and the client's plan for the parallelism of the map phase is basically The logic is: performing logical slicing on the data to be processed (that is, dividing the data to be processed into logical splits according to a specific slice size), and then assigning a mapTask parallel instance processing to each split.

ReduceTask par allelism decision mechanism
The degree of parallelism of reducetask also affects the execution concurrency and execution efficiency of the entire job, but the number of concurrency with maptask is determined by the number of slices. The decision of the number of Reducetask can be set manually. / / The default value is 1, manually set to 4
job.setNumReduceTasks(4);
If the data is not evenly distributed, it is possible to generate data skew in the reduce phase. In some cases, you need to calculate the global summary result, you can only have one reducetask, try not to run too many reduce tasks. For most jobs, the best number of reduce is the same as the reduce in the cluster.

Maperduce shuffle mechanism
In mapreduce, how the data processed by the map stage is passed to the reduce stage is the most critical process in the mapreduce framework. This process is called shuffle.
Specifically: the data of the processing result output by maptask is distributed to reducetask, and in the process of distribution, the data is partitioned and sorted by key;
  • Overall, it is divided into 3 operations:
  • Partition partition
  • Sort sorted by key
  • Combiner performs local value merging
  •  The Reduce task asks the JobTracker through the RPC whether the Map task has been completed. If it is completed, it will receive the data.
  • Reduce the data to be put into the cache first, from different Map machines, merged first, then merged, and written to disk
  • Multiple overflow files are merged into one or more large files, and the key-value pairs in the file are sorted
  • When the data is small, do not need to overflow to disk, merge directly in the cache, and then output to Reduce
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.