In-depth analysis of MapReduce Architecture Design and Implementation Principles-Reading Notes (4) MR and Partitio

Last Update:2018-06-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MR parses MapperReducer to encapsulate the data processing logic of the application. All data stored in the underlying Distributed File System must be interpreted as keyvalue. It is handed over to mapreduce function processing in MR to generate other keyvalues. Mapper1) initialize Mapper to inherit the JobConfigurable interface. This config method allows

MR parsing Mapper/Reducer encapsulates the data processing logic of the application. All data stored in the underlying Distributed File System must be interpreted as key/value. It is handed over to the map/reduce function in MR to generate other keys/values. Mapper 1) initializes Mapper and inherits the JobConfigurable interface. This config method allows

MR Parsing

Mapper/CER encapsulates the data processing logic of the application.
All data stored in the underlying Distributed File System must be interpreted as key/value. It is handed over to the map/reduce function in MR to generate other keys/values.

Mapper1) initialization

Mapper inherits the JobConfigurable interface. This config method allows Mapper initialization through the JobConf parameter.

2) Map operations

MapReduce obtains a key/value pair from InputSplit through RecordReader in InputFormat and submits it to the map () function for processing:
Void map (K1 key, V2 value, OutputCollector Output, Reporter reporter) throws IOException;

3) Clear

Mapper obtains the close method by inheriting Colseable. You can clear Mapper by implementing this method.

Mapper type

ChainMapper chained job; IdentityMapper does not process the input and outputs it directly; InvertMapper exchanges the key/value position;
RegexMapper Regular Expression string segmentation; TokenMapper splits the string into several tokens and can be used as the Mapper of wordCount;
LongSumReducer: calculates the sum of values of the long type based on keys.
The new ER er is changed from an interface to an abstract class. Instead of inheriting JobConfigurable and Closeable, it directly adds the setup and cleanup methods to the class for initialization and cleaning.
Encapsulate parameters in the Context object, and the interface has good scalability.
Remove the MapRunnable interface and add the run method to Mapper to help you customize the calling method of map () function.
In the new API, The iterator type of the CER traversal value changes to Iterable.

void reduce(KEYIN key,Iteratable values,Context context) throws IOException,InterrupteException{for(VALUEIN value:values){context.write((KEYOUT) key,(VALUEOUT) value);}}

Design and Implementation of the Partitioner Interface

Partitioner is used to partition the intermediate results generated by Mapper, so that data in the same group can be handed over to the same CER for processing. It directly affects load balancing in the Reduce stage.
Only the getPartition method to be implemented is included. This method contains three parameters, all of which are passed by the framework. The first two parameters are key/value, and the third parameter numPartitions indicates the number of parts for each Mapper,
That is, the number of reducers.

HashPartitioner and TotalOrderPartitioner. HashPartitioner is the default implementation: public int getPartition (K2 key, V2 value, int numReduceTasks) {return (key. hashCode () & Integer. MAX_VALUE) % numReduceTasks ;}

TotalOrderPartitioner provides a range-based sharding method, which is usually used in full data sorting and merge sorting.
In the Map stage, each MapTask performs partial sorting. In the Reduce stage, a ReduceTask is started to perform global sorting. The job can only have one cetcetask, which leads to a bottleneck.
TotalOrderPartitioner divides data into several intervals by size, and ensures that all data in the last interval is greater than the data in the previous interval.

Step 1: sample data.

Obtain the Split points of a shard through sampling on the client.
Sample Data: B, abc, abd, bcd, abcd, efg, hii, afd, rrr, mnk
After sorting: abc, abcd, abd, afd, B, bcd, efg, hii, mnk, rrr
If there are four Reduce tasks, the quartile points of the sample data are abd, bcd, and mnk.

Step 2: Map stage.

Mapper can use IdentityMapper to directly output input data. TotalOrderPartitioner saves the Split points obtained in step 1 to the trie tree to quickly locate the interval of any record.
Map tasks generate R intervals and the intervals are ordered in the middle.

Step 3: Reduce stage.

Each CER performs partial sorting on the data in the allocated range to obtain the fully ordered data.
TotalOrderPartitioner has two typical application instances: TeraSort and HBase.
HBase internal data is ordered, and Region is also ordered.

Original article address: in-depth analysis of MapReduce Architecture Design and Implementation Principles-Reading Notes (4) MR and Partitioner, thanks to the original author for sharing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

In-depth analysis of MapReduce Architecture Design and Implementation Principles-Reading Notes (4) MR and Partitio

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

In-depth analysis of MapReduce Architecture Design and Implementation Principles-Reading Notes (4) MR and Partitio

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support