1. Data flow in MapReduce
(1) The simplest process: map-reduce
(2) The process of customizing the partitioner to send the results of the map to the specified reducer: map-partition-reduce
(3) added a reduce (optimization) process at the local advanced Time: map-combin (local reduce)-partition-reduce
2. The concept and use of partition in MapReduce.
(1) Principle and function of partition
What reducer do they assign to the map after they get the records? The default distribution method used by Hadoop is distributed based on hash values, but in practice this is not very efficient or performs the task as we ask. For example, after partition processing, the reducer of one node is assigned to 20 records, the other is allocated 10W million, just imagine how efficient this situation is. Or, we want to process the resulting file according to a certain pattern of output, assuming that there are two reducer, we want the final result in the part-00000 stored in the "H" is the result of the record, part-00001 store the other beginning of the results, These default partitioner are not to be done. Therefore, we need to customize partition to choose the record reducer according to our own requirements. Custom Partitioner is simple, as long as you customize a class, and inherit the Partitioner class, overriding its Getpartition method is good, when used by calling the job's setpartitionerclass to specify can be
The results of the map will be distributed to reducer via partition. Mapper results, may send to combiner do merge, combiner in the system does not have their own base class, but with reducer as the base class of combiner, their external function is the same, but the use of the location and use of the context is not the same. Mapper final processing of the key value to <key, Value>, is to send to the reducer to merge, when the merger, there is the same key/value pair will be sent to the same reducer that. Which key to which reducer the allocation process, is stipulated by Partitioner. It has only one method,
Getpartition (text key, text value, int numpartitions)
The input is the result of map to the number of <key, value>, and reducer, and the output is the assigned reducer (integer number). is to specify the key value of the MAPPR output to which reducer go up. The system default Partitioner is Hashpartitioner, it takes the hash value of key to modulo the number of reducer, obtains the corresponding reducer. This guarantees that if there is the same key value, it must be assigned to the same reducre. If there are n reducer, the number is 0,1,2,3 ... (N-1).
(2) Use of partition
What is the need for partitioning to produce a globally ordered file using Hadoop? The simplest approach is to use a partition, but the method is extremely inefficient when working with large files, because a machine must process all output files, thereby completely losing the benefits of the parallel architecture provided by MapReduce. In fact, we can do this by first creating a series of well-ordered files, followed by concatenating the files (similar to the merge sort), and finally getting a globally ordered file. The main idea is to use a partitioner to describe the output of global sorting. Let's say we have 1000 1-10000 data, run 10 ruduce tasks, and if we run partition, we can allocate the data in 1-1000 to the first reduce and 1001-2000 of the data to the second reduce, And so on That is, the data allocated by the nth reduce is all greater than the data in the n-1 reduce. In this way, each reduce comes out of order, we just want all the cat output files to become a large file, it is orderly
The basic idea is this, but now there is a question of how the interval of data is divided, the amount of data is large, and we do not know the distribution of data in the case. A relatively simple method is sampling, if there are 100 million of the data, we can sample data, such as taking 10,000 data samples, and then the sampling data to be divided into intervals. In Hadoop, patition we can replace the default partition with Totalorderpartitioner. Then pass the result of the sample to him and you can implement the partition we want. At the time of sampling, we can use several sampling tools of Hadoop, Randomsampler,inputsampler,intervalsampler.
This allows us to sort large amounts of data using a distributed file system, and we can override the Compare function in the Partitioner class to define the rules for comparisons, so that you can sort the strings or other non-numeric types, or even order two or more times.
2. The concept and use of grouping in MapReduce
The purpose of partitioning is to determine the mapper output record to which reducer is processed, based on the key value. And the group is better understood. The author thinks that the grouping is related to the key of the record. Within the same partition, records with the same key value belong to the same grouping.
3. Use of combiner in MapReduce
Many mapreduce programs are limited by the bandwidth available on the cluster, so it tries to minimize intermediate data that needs to be transferred between the map and reduce tasks. Hadoop allows the user to declare a combiner function to handle the output of the map, as well as to input the results of its processing of the map as a reduce. Because the combiner function itself is only an optimization, Hadoop does not guarantee how many times this method will be called for a map output. In other words, the corresponding reduce output should be the same regardless of how many times the combiner function is called.
The following is illustrated in the example of the authoritative guide, assuming that 1950-year weather data reads were done by two maps, with the output of the first map as follows:
(1950, 0)
(1950, 20)
(1950, 10)
The output from the second map is:
(1950, 25)
(1950, 15)
The inputs for reduce are: (1950, [0, 20, 10, 25, 15]), and the output is: (1950, 25)
Since 25 is the maximum value in the collection, we can use a combiner function similar to the reduce function to find the maximum value in each map output, so that the input of reduce becomes:
(1950, [20, 25])
The process of the temperature value of each funciton can be expressed as follows: Max (0, 0, ten, 20) =max (Max (25, ten), max () = max (25)
Note: Not all functions have this property (functions that have this attribute we call commutative and associative), for example, if we want to calculate the average temperature, we cannot use combiner function as such, because mean (0, 20, 10, 25, 15) = 14, while mean (mean (0,, ten), mean (+)) = Mean (10, 20) = 15
The combiner function does not replace the reduce function (because the reduce function is still required to process records with the same key from different maps). But he can help reduce the data that needs to be transferred between map and reduce, and for this combiner function is worth considering.
4, Shuffle stage sequencing process detailed
Let's first look at the overall flow of sequencing in MapReduce.
The MapReduce framework ensures that each reducer input is sorted by key. In general, the process of transferring the output of a sort and map to reduce is called a wash-off (shuffle). Each map contains a ring-shaped cache, and the default 100m,map first writes the output to the cache. When the cached content reaches "threshold" (the default size of the threshold is 80% of the cache), a background thread is responsible for writing the results to the hard disk, a process known as "spill". In the spill process, the map can still write the results to the cache, and if the cache is already full, the map waits.
The specific process of spill is as follows: First, the background thread groups the output results according to the number of reducer, and each group corresponds to a reducer. Second, for each of the grouped background threads, the output key is sorted. In the sort process, if there is a combiner function, the sort result is called by the combiner function. Each time spill produces a spill file on the hard drive. Therefore, a map task may produce multiple spill files, and when the map writes out the last output, it merges and sorts all the spill files, outputting the final result file. The Combiner function will still be called during this process. From the whole process, the number of calls to the Combiner function is indeterminate. Below we focus on the next shuffle phase of the sequencing process:
The ordering of the shuffle phase can be understood as two parts, one is to partition the spill, because a partition contains more than one key value, so the <key,value> in the partition is sorted by key, that is, the key value of the same string <key,value > is stored together so that a partition is ordered in accordance with the key value.
The second part is not the sort, but the merge,merge has two times, once is the map side will be multiple spill according to partition and partition key to merge, form a large file. The second merge is on the reduce side, and the output of multiple maps into the same reduce merges together, and the merge is somewhat complex to understand, and ultimately not to form a large file, and the data is available on both memory and disk. Therefore, the shuffle phase of the merge is not a strict ordering significance, but to merge multiple whole ordered files into a large file, because the different task execution map output will be different, so the result of the merge is not always the same, but the strict requirements according to partition, At the same time, the <key,value> of the same key within each partition is adjacent.
Shuffle Sort Summary: If only the map function is defined, and the reduce function is not defined, then the input data is sorted by shuffle, the output with the same key value is together, and the key value is small must be in the front, so the overall key value is orderly (macro meaning, Not necessarily from large to small, because if the default Hashpartitioner, the hash value of key is equal to a partition, if the key is intwritable, each partition key will be sorted, and each key corresponding value is not ordered.
5. The principle and implementation of auxiliary sequencing in MapReduce
(1) Tasks
We need to process the following sample.txt files as follows:
Source file: Sample.txt
BBB 654
CCC 534
DDD 423
AAA 754
BBB 842
CCC 120
DDD 219
AAA 344
BBB 214
CCC 547
DDD 654
AAA 122
BBB 102
CCC 479
DDD 742
AAA 146
Target: part-r-00000
AAA 122
BBB 102
CCC 120
DDD 219
(2) Working principle
Process guidance:
1. Define a key combination containing the record value and the natural value, in this case mypariwritable.
2. The custom key comparator (comparator) is used to sort records according to the combination of keys, i.e. by using natural keys and natural values. (AAA 122 is combined as a key).
3. Partitioner for combination keys (This example uses the default Hashpartitioner) and the grouping comparator only consider natural keys when partitioning and grouping.
Detailed process:
First, in the map phase, inputformat the input dataset into small chunks splites using the Job.setinputformatclass definition, and InputFormat provides a recordreder implementation. In this example, Textinputformat is used, and the recordreder that he provides will take the line number of a line of text as key and the text of the line as value. This is why the input to the custom map is <LongWritable, Text>. Then call the map method of the custom map and set the map method of <LongWritable, Text> to map. Note that the output should conform to the output < MyPariWritable, NullWritable> defined in your custom map. The end result is a list< mypariwritable, nullwritable>. At the end of the map phase, Job.setpartitionerclass is called to partition the list, with each partition mapped to a reducer. The key comparison function class ordering for the Job.setsortcomparatorclass setting is called within each partition. As you can see, this is in itself a two-time sort. In the reduce phase, reducer receives all map outputs mapped to this reducer, and is also the key comparison function class that calls Job.setsortcomparatorclass settings to sort all data pairs. It then begins to construct a value iterator corresponding to the key. In this case, we need to use the Grouping function class set by Jobjob.setgroupingcomparatorclass. As long as the comparator compares the same two keys, they belong to the same group (in this case the minimum value in each partition is required, so when comparing the mypariwritable type of key, only the natural keys are compared. This guarantees that as long as the natural keys of the two mypariwritable are the same, the key they are sent to the reduce end is considered to be in the same grouping, because the group key only takes the first one in the group, And the data has been ordered in accordance with the custom Mypariwritable comparator, the first key contains exactly the minimum value corresponding to each natural key, and their value is placed in a value iterator, The key of this iterator uses the first key of all keys that belong to the same group. The last step is to enter reducer R.The Educe method, the input of the reduce method, is all the key and its value iterator. Also note that the type of the input and output must be consistent with the declaration in the custom reducer.
Hadoop MapReduce Partitioning, grouping, two ordering