Mapper The execution process of the task:
- The first stage is the input file according to a certain standard Shard (inputsplit), the size of each input piece is fixed. By default, the size of the input slice (inputsplit) is the same as the size of the data block (block). If the size of the block (block) is the default value of 64MB, the input file has two, one is 32MB, and the other is 72MB. So the small file is an input piece, the large file will be divided into two pieces of data, then two input pieces. Altogether three input slices are produced. Each input piece consists of a Mapper process processing . Here are three input pieces, there will be three mapper process processing.
- The second stage is to parse the records in the input slices into key-value pairs according to certain rules. A default rule is to parse each line of text into a key-value pair. The "key" is the starting position (in bytes) of each row, and the value is the text content of the bank.
- The third stage is to call the map method in the Mapper class. The second phase resolves each key-value pair, calling a map method. If there are 1000 key-value pairs, the map method is called 1000 times. Each time the map method is called, 0 or more key-value pairs are output.
- The Forth stage is to partition the key-value pairs of the third stage output according to certain rules. The comparison is based on the key. For example, our key indicates provinces (such as Beijing, Shanghai, Shandong, etc.), then can be divided according to different provinces, the same province of the key-value pairs into a region. The default is only one zone . the number of partitions is Reducer the number of task runs . There is only one reducer task by default.
- The fifth stage is to sort the key-value pairs in each partition. First, sort by key, and for key-value pairs with the same key, sort by value. For example, three key values for <2,2>, <1,3>, <2,1>, and keys and values are integers respectively. Then the result of sorting is <1,3>, <2,1>, <2,2>. If there is a sixth stage, then the sixth stage is entered, and if not, the output is directly to the local Linux file.
- The sixth stage is the processing of data, that is, reduce processing. Key -value pairs with equal keys are called once Reduce method . By this stage, the amount of data will be reduced. The data is output to a local Linxu file. This stage is not the default and requires the user to add the code for this phase themselves .
Reducer the execution process of a task
MapReduce Execution Process