The understanding of MapReduce

Source: Internet
Author: User
Tags shuffle

1. Design a parallel computing framework, should you consider those issues?

The first question is: Does parallel computing have to be more than one computer, and how do they divide tasks between multiple computers?

This place has a module to distribute the task, which means it is the boss, it is to maintain the task or resources

MapReduce is the Jobtracker,hadoop 2.x version of the Hadoop 1.x version is managed by yarn, which is ResourceManager to manage the other nodes and how to distribute the tasks.

The younger brother is Tasktracker on the Hadoop 1.x version, and on the HADOOP2 version Nodemanager,nodemanager starts a process yarnchild to run the processing calculation data.

The second question is: Where does the computational data needed for parallel computing come from?

A task is very big, if all put on the eldest side, is not the pressure is very big? So they use out-of-the-box to store data in the HDFs to store the data, the client to ResourceManager get the task to allow, and then the required jar package, rely on the HDFs node, let them all to get the required tasks, the boss just tell them a certain mark on it.

The third question is: How do you summarize the results that are computed in parallel?

Parallel computing the calculated data, and ultimately to the HDFS, it is impossible to write to the boss, the boss may also continue to accept other people to the new task, it is impossible to put on each node, so that the data is too discrete, and finally chose to keep on the HDFs.

The fourth question is: how do some of the tasks in this process fail and what can be done to compensate?

They use RPC communication, (known as the beating heart mechanism, to give feedback to the boss from time to time), and the boss is letting the other nodemanager continue to do these things to compensate for the calculations.

What is the operating process of 2.mapreduce?

Client

Jobtracker


Inputsplit->mapper ()

Mapoutput----Shuffle---reducer ()------>output

Inputsplit->mapper ()


Inputsplit: A inputsplit corresponds to this map function: The line is also treated as a mapper function.

Mapper output [Hello 1] [Zhang 1] [San 1]

Shuffle: Groups the desired results. Like a group of Hello, [Hello, (1,1,1)]

Reduceer: Output Hello 5

Zhangsan 1


Data to be transferred between networks in Hadoop must be serialized (write the file to disk first, and then over the network from disk)

Hadoop uses its own efficient serialization mechanism to replace the Java version of the serialization mechanism (String,long and so on seriable),

The Hadoop serialization mechanism must implement the writable interface.





This article from the "Jane Answers Life" blog, reproduced please contact the author!

The understanding of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.