Hadoop Learning notes -9.partitioner and custom Partitioner

Source: Internet
Author: User

First, the preliminary exploration Partitioner1.1 again review the map stage five big strides

In the fourth post, "Initial MapReduce," we learned about the eight strides of MapReduce, including a total of five steps in the map phase, as shown in:

Where step1.3 is a partitioning operation. Through the previous study we know mapper final processing of the key value to <key, Value>, is to send to the reducer to merge, when the merger, there is the same key/value pair will be sent to the same reducer node to merge. which key to which reducer the allocation process, is stipulated by Partitioner . In some cluster applications, such as distributed cache clusters, most of the cached data is distributed evenly by hashing functions, and is no exception in Hadoop.

1.2 Hadoop built-in Partitioner

Users of MapReduce typically specify the number of reduce task and reduce task output files (R). The user uses a partition function on the middle key to partition the data and then enter into the successor task execution process. A default partitioning function uses the hash method (such as the common: hash (key) mod R) to partition. The hash method is capable of producing very balanced partitions, in view of this, Hadoop comes with a default partition class Hashpartitioner, which inherits the Partitioner class and provides a Getpartition method, which is defined as follows:

/**Partition keys by their {@linkObject#hashcode ()}. */ Public classHashpartitioner<k, v>extendsPartitioner<k, v> {  /**Use {@linkObject#hashcode ()} to partition. */   Public intgetpartition (K key, V value,intnumreducetasks) {    return(Key.hashcode () & Integer.max_value)%Numreducetasks; }}

Now let's look at what Hashpartitoner is doing, and its key code is one sentence: (Key.hashcode () & integer.max_value)% Numreducetasks;

The purpose of this code is to distribute the key evenly across the reduce tasks , for example: If key is text, the Hashcode method of text is consistent with the basic string, which is calculated using the Horner formula, and an int integer is obtained. However, if the string is too large, the int integer value may overflow to a negative number, so the upper value of the integer Integer.max_value (that is, 0111111111111111) is calculated and calculated, and then the reduce task count is redundant, This allows the key to be distributed evenly on the reduce.

Second, custom Partitioner

In most cases, we will use the default partition function Hashpartitioner. But sometimes we have some special application requirements, so we need to customize Partitioner to complete our business. Here is an example of the fifth-custom data type processing mobile Internet log, which makes a special partition of the log content:

We can find that in the second column is not all the data are mobile phone number (for example: 84138413 is not a mobile phone number), our task is to count the mobile phone traffic, the mobile phone number and non-mobile phone number output to different files.

2.1 Custom Kpipartitioner
    /** Custom Partitioner class*/     Public Static classKpipartitionerextendsPartitioner<text, kpiwritable>{@Override Public intGetpartition (Text key, kpiwritable value,intnumpartitions) {            //implement different lengths of different numbers assigned to different reduce task            intNumlength =key.tostring (). Length (); if(Numlength = = 11) {                return0; } Else {                return1; }        }    }

The distinction between mobile phone and non-mobile number is divided by the length of the field, if it is 11 digits is the mobile number. The next step is to re-modify the code in the Run method: set to package run, set Partitioner to Kpipartitioner, set the number of Reducertask to 2;

     Public intRun (string[] args)throwsException {//first delete the output directory files that have been generatedFileSystem fs = Filesystem.get (NewURI (Input_path), getconf ()); Path Outpath=NewPath (Output_path); if(Fs.exists (Outpath)) {Fs.delete (Outpath,true); }        //define a jobJob Job =NewJob (getconf (), "Mykpijob"); //partitions need to be set to run as packagedJob.setjarbyclass (mykpijob.class); //Set Input DirectoryFileinputformat.setinputpaths (Job,NewPath (Input_path)); //To set a custom mapper classJob.setmapperclass (Mymapper.class); //Specify the type of <k2,v2>Job.setmapoutputkeyclass (Text.class); Job.setmapoutputvalueclass (kpiwritable.class); //Set PartitionerJob.setpartitionerclass (Kpipartitioner.class); Job.setnumreducetasks (2); //set CombinerJob.setcombinerclass (Myreducer.class); //To set a custom reducer classJob.setreducerclass (Myreducer.class); //Specify the type of <k3,v3>Job.setoutputkeyclass (Text.class); Job.setoutputkeyclass (kpiwritable.class); //Set Output DirectoryFileoutputformat.setoutputpath (Job,NewPath (Output_path)); //Submit JobSystem.exit (Job.waitforcompletion (true) ? 0:1); return0; }

Note: The partition example must be set to run as a jar package!

2.2 Play into a jar package and run in Hadoop

(1) exporting jar packages via eclipse

(2) Upload to Linux via FTP, you can use a variety of FTP tools, I generally use xftp.

(3) Executing the program in the jar package through the Hadoop shell

  

(4) View the execution result file:

The first is part-r-00000, which shows the statistical results of mobile numbers

And then part-r-00001, which shows the statistics of non-mobile numbers.

(5) Verify the operation of the partitioner via the Web interface: by accessing Http://hadoop-master:50030

① do you have 2 reduce tasks?

As can be seen, a total of 2 reduce tasks;

is the ②reduce output consistent?

Mobile phone number has 20 records, consistent!

Non-mobile phone number only 1 records, consistent!

Summary: the main role of Partition Partitioner lies in the following two points

(1) Produce multiple output files according to business needs;

(2) Multiple reduce tasks run concurrently to improve the overall job efficiency

Resources

(1) Chao Wu, "in Layman's Hadoop": http://115.28.208.222/

(2) Wanchunmei, Xie Zhenglan, "Hadoop Application Development Practical Explanation (Revised version)": http://item.jd.com/11508248.html

(3) Suddenly, "Hadoop diary day17-Partition": http://www.cnblogs.com/sunddenly/p/4009568.html

(4) Three Rob, "How to use Partitioner in Hadoop": http://qindongliang.iteye.com/blog/2043136

Zhou Xurong

Source: http://edisonchou.cnblogs.com/

The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to give the original link.

Hadoop Learning notes -9.partitioner and custom Partitioner

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.