MR parses MapperReducer to encapsulate the data processing logic of the application. All data stored in the underlying Distributed File System must be interpreted as keyvalue. It is handed over to mapreduce function processing in MR to generate other keyvalues. Mapper1) initialize Mapper to inherit the JobConfigurable interface. This config method allows
MR parsing Mapper/Reducer encapsulates the data processing logic of the application. All data stored in the underlying Distributed File System must be interpreted as key/value. It is handed over to the map/reduce function in MR to generate other keys/values. Mapper 1) initializes Mapper and inherits the JobConfigurable interface. This config method allows
MR Parsing
Mapper/CER encapsulates the data processing logic of the application.
All data stored in the underlying Distributed File System must be interpreted as key/value. It is handed over to the map/reduce function in MR to generate other keys/values.
Mapper1) initialization
Mapper inherits the JobConfigurable interface. This config method allows Mapper initialization through the JobConf parameter.
2) Map operations
MapReduce obtains a key/value pair from InputSplit through RecordReader in InputFormat and submits it to the map () function for processing:
Void map (K1 key, V2 value, OutputCollector Output, Reporter reporter) throws IOException;
3) Clear
Mapper obtains the close method by inheriting Colseable. You can clear Mapper by implementing this method.
Mapper type
ChainMapper chained job; IdentityMapper does not process the input and outputs it directly; InvertMapper exchanges the key/value position;
RegexMapper Regular Expression string segmentation; TokenMapper splits the string into several tokens and can be used as the Mapper of wordCount;
LongSumReducer: calculates the sum of values of the long type based on keys.
The new ER er is changed from an interface to an abstract class. Instead of inheriting JobConfigurable and Closeable, it directly adds the setup and cleanup methods to the class for initialization and cleaning.
Encapsulate parameters in the Context object, and the interface has good scalability.
Remove the MapRunnable interface and add the run method to Mapper to help you customize the calling method of map () function.
In the new API, The iterator type of the CER traversal value changes to Iterable.
void reduce(KEYIN key,Iteratable values,Context context) throws IOException,InterrupteException{for(VALUEIN value:values){context.write((KEYOUT) key,(VALUEOUT) value);}}
Design and Implementation of the Partitioner Interface
Partitioner is used to partition the intermediate results generated by Mapper, so that data in the same group can be handed over to the same CER for processing. It directly affects load balancing in the Reduce stage.
Only the getPartition method to be implemented is included. This method contains three parameters, all of which are passed by the framework. The first two parameters are key/value, and the third parameter numPartitions indicates the number of parts for each Mapper,
That is, the number of reducers.
HashPartitioner and TotalOrderPartitioner. HashPartitioner is the default implementation: public int getPartition (K2 key, V2 value, int numReduceTasks) {return (key. hashCode () & Integer. MAX_VALUE) % numReduceTasks ;}
TotalOrderPartitioner provides a range-based sharding method, which is usually used in full data sorting and merge sorting.
In the Map stage, each MapTask performs partial sorting. In the Reduce stage, a ReduceTask is started to perform global sorting. The job can only have one cetcetask, which leads to a bottleneck.
TotalOrderPartitioner divides data into several intervals by size, and ensures that all data in the last interval is greater than the data in the previous interval.
Step 1: sample data.
Obtain the Split points of a shard through sampling on the client.
Sample Data: B, abc, abd, bcd, abcd, efg, hii, afd, rrr, mnk
After sorting: abc, abcd, abd, afd, B, bcd, efg, hii, mnk, rrr
If there are four Reduce tasks, the quartile points of the sample data are abd, bcd, and mnk.
Step 2: Map stage.
Mapper can use IdentityMapper to directly output input data. TotalOrderPartitioner saves the Split points obtained in step 1 to the trie tree to quickly locate the interval of any record.
Map tasks generate R intervals and the intervals are ordered in the middle.
Step 3: Reduce stage.
Each CER performs partial sorting on the data in the allocated range to obtain the fully ordered data.
TotalOrderPartitioner has two typical application instances: TeraSort and HBase.
HBase internal data is ordered, and Region is also ordered.
Original article address: in-depth analysis of MapReduce Architecture Design and Implementation Principles-Reading Notes (4) MR and Partitioner, thanks to the original author for sharing.