1. Design a parallel computing framework, should you consider those issues?
The first question is: Does parallel computing have to be more than one computer, and how do they divide tasks between multiple computers?
This place has a module to distribute the task, which means it is the boss, it is to maintain the task or resources
MapReduce is the Jobtracker,hadoop 2.x version of the Hadoop 1.x version is managed by yarn, which is ResourceManager to manage the other nodes and how to distribute the tasks.
The younger brother is Tasktracker on the Hadoop 1.x version, and on the HADOOP2 version Nodemanager,nodemanager starts a process yarnchild to run the processing calculation data.
The second question is: Where does the computational data needed for parallel computing come from?
A task is very big, if all put on the eldest side, is not the pressure is very big? So they use out-of-the-box to store data in the HDFs to store the data, the client to ResourceManager get the task to allow, and then the required jar package, rely on the HDFs node, let them all to get the required tasks, the boss just tell them a certain mark on it.
The third question is: How do you summarize the results that are computed in parallel?
Parallel computing the calculated data, and ultimately to the HDFS, it is impossible to write to the boss, the boss may also continue to accept other people to the new task, it is impossible to put on each node, so that the data is too discrete, and finally chose to keep on the HDFs.
The fourth question is: how do some of the tasks in this process fail and what can be done to compensate?
They use RPC communication, (known as the beating heart mechanism, to give feedback to the boss from time to time), and the boss is letting the other nodemanager continue to do these things to compensate for the calculations.
What is the operating process of 2.mapreduce?
Client
Jobtracker
Inputsplit->mapper ()
Mapoutput----Shuffle---reducer ()------>output
Inputsplit->mapper ()
Inputsplit: A inputsplit corresponds to this map function: The line is also treated as a mapper function.
Mapper output [Hello 1] [Zhang 1] [San 1]
Shuffle: Groups the desired results. Like a group of Hello, [Hello, (1,1,1)]
Reduceer: Output Hello 5
Zhangsan 1
Data to be transferred between networks in Hadoop must be serialized (write the file to disk first, and then over the network from disk)
Hadoop uses its own efficient serialization mechanism to replace the Java version of the serialization mechanism (String,long and so on seriable),
The Hadoop serialization mechanism must implement the writable interface.
This article from the "Jane Answers Life" blog, reproduced please contact the author!
The understanding of MapReduce