From: http://caibinbupt.iteye.com/blog/336467
Everyone is familiar with file systems. Before analyzing HDFS, we didn't spend a lot of time introducing the background of HDFS. After all, you still have some understanding of file systems, there are also good documents. Before analyzing hadoop mapreduce, we should first understand how the system works, and then enter our Analysis Section. The following figure is from the example.
Take the wordcount in hadoop as an example (below is the startup line ):
Hadoop jars hadoop-0.19.0-examples.jar wordcount/usr/input/usr/Output
After the user submits a task, the task is coordinated by jobtracker. The map stage (M1, M2, and m3 in the figure) is executed first, and then the reduce stage (R1 and R2 in the figure) is executed ). The map and reduce actions are monitored by tasktracker and run on a Java virtual machine independent of tasktracker.
Both input and output are directories on HDFS (as shown in ). The input is described by the inputformat interface. Its implementation includes ASCII files and JDBC databases, which process data sources separately and provide some data features. Through the inputformat implementation, you can obtain the implementation of the inputsplit interface. This implementation is used to divide the data (from splite1 to splite5 in the figure, which is the result after division ), you can also obtain the implementation of the recordreader interface from inputformat and generate <K, V> pairs from the input. With <K, V>, you can start the map operation.
The map operation passes context. Collect (OutputCollector.
Collect) write the result to context. When mapper outputs are collected, they are output to the output file in a specified way by the partitioner class. We can provide combiner for Mapper. When mapper outputs its <K, V>, key-value pairs are not immediately written to the output, they will be collected in the list (a key value and a list). When a certain number of key-value pairs are written, this part of the buffer will be merged in the combiner, then output to partitioner (the yellow color of M1 corresponds to combiner and partitioner ).
After the map action is completed, it enters the reduce stage. This phase involves three steps: shuffle, sort, and reduce.
In the mixed washing stage, the hadoop mapreduce framework will be based on the key in the map results, transmit the relevant results to a reducer (the intermediate results of the same key generated by multiple mappers are distributed on different machines. After this step is completed, they are all transferred to the Reducer Machine that processes this key ). In this step, the file transmission uses the HTTP protocol.
Sort and mixed wash are performed in one piece. In this phase, the <key, value> pairs with the same key value from different mappers are merged together.
In the reduce stage, the <key, (list of values)> obtained after shuffle and sort are sent to the CER Cer. Reduce method for processing. The output result is output to DFS through outputformat.