The MapReduce of Hadoop

Last Update:2018-07-20 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Absrtact: MapReduce is another core module of Hadoop, from what MapReduce is, what mapreduce can do and how MapReduce works. MapReduce is known in three ways.

Keywords: Hadoop MapReduce distributed processing

In the face of big data, big data storage and processing, like a person's right and left hand, appears particularly important. Hadoop is more suited to solving big data problems, and relies heavily on its big data storage systems, HDFS and big-data processing systems, or mapreduce. For HDFs, you can refer to the "HDFs of Hadoop" article written by the author. For MapReduce, we recognize MapReduce from three of the following questions.

Question one: what is MapReduce.

Question two: What MapReduce can do.

Question three: The working mechanism of MapReduce.

For the first question, we cite the Apache Foundation's introduction to MapReduce "Hadoop MapReduce is a software framework for easily writing applications which Process vast amounts of data (Multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity H Ardware in a reliable, fault-tolerant manner. " So, Hadoop MapReduce is a software framework that makes it easy to write applications that run on large clusters of thousands of commercial machines and work in parallel in a reliable, fault-tolerant way to handle up-to-the-top terabytes of data sets. This definition has these keywords, one is the software framework, the second is the parallel processing, three is reliable and fault-tolerant, four is a large-scale cluster, five is a massive data set. Therefore, for mapreduce, it is simple to think that it is a software framework, the massive data is its "dish", it in a large-scale cluster in a reliable and fault-tolerant way in parallel to "cook this dish." Written here, the author heartily lamented the greatness of thought, the magic of decomposition, the ingenious merger.

Knowing what MapReduce is, and about the second question, it's clear. What MapReduce can do. In a nutshell, you can do big data processing. The so-called Big data processing, that is, value-oriented, processing, mining and optimization of large-size processing.

MapReduce is good at dealing with big data, why does it have this ability? This can be found by the design idea of MapReduce. The idea of MapReduce is "divide and conquer". Mapper is responsible for "division", that is, to break down complex tasks into a number of "simple tasks" to deal with. The "Simple task" contains three meanings: first, the scale of the data or computation is much smaller than the original task, and the nearest calculation principle is that the task is allocated to the nodes that hold the required data, and the third is that these small tasks can be computed in parallel, with little dependency between them. Reducer is responsible for summarizing the results of the map phase. As to how many reducer are required, the user can set the value of the parameter Mapred.reduce.tasks in the Mapred-site.xml configuration file according to the specific problem, with a default value of 1.

How does MapReduce deal with big data? Users can use the MapReduce application to implement the operation of big data. How does the MapReduce program work, since it is processing big data with a mapreduce program? This is the third problem, the working mechanism of MapReduce.

The entire working process of mapreduce, as shown in the figure above, contains the following 4 separate entities.

Entity one: The client, used to submit a mapreduce job.

Entity two: Jobtracker, used to coordinate the operation of the job.

Entity three: Tasktracker, used to deal with tasks after the division of the job.

Entity four: HDFS, which is used to share job files among other entities.

By reviewing the work flow diagram of MapReduce, we can see that the whole work process of MapReduce contains the following work steps in order.

Link one: The submission of the job

Session Two: Initialization of the job

Link Three: Assignment of tasks

Session Four: implementation of the task

Session Five: Update of process and status

Session VI: Completion of the work

about what to do in each link, you can read the sixth chapter of the Hadoop authoritative guide on the work mechanism of mapreduce.

For users, if you want to use MapReduce to handle big data, you need to write a mapreduce application on demand. Therefore, how to use the MapReduce framework to develop the program is a matter of deep thought and practice.

Source:

1 http://www.wangluqing.com/2014/02/hadoop-mapreduce/

2 http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

3 http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html

4 Refining into Gold "Hadoop data analytics Platform" course

5 "The authoritative Guide to Hadoop (second Edition)", chapter sixth, the work mechanism of MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More