Hadoop combiner Components

Last Update:2016-01-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One: Background

In the MapReduce model, the function of reduce is mostly statistical classification type of total, the maximum value of the minimum value, etc., for these operations can consider the map output after the combiner operation, so as to reduce network transport load, while reducing the burden of reduce tasks . The combiner operation is run on each node, only affects the output of the local map, combiner input is the output of the local map, many times the logic of combiner and reduce logic is the same, so the two can share the reducer body.

Two: When to run combiner

(1): when the job set combiner, and the number of spill reached the min.num.spill.for.combine(default is 3), then combiner will be executed before the merge.

(2): However, in some cases, the merge starts, but the number of spill files does not meet the requirements, and combiner may run after the merge.

(3): Combiner may also not run, Combiner will consider a load situation at the time of the cluster.

Three: Program code

[Java]View PlainCopy

Public class Wordcounttest {
//define Input path
private static final String Input_path = "Hdfs://liaozhongmin:9000/hello";
//define Output path
private static final String Out_path = "Hdfs://liaozhongmin:9000/out";
public static void Main (string[] args) {
try {
//Create configuration information
Configuration conf = new configuration ();
/**********************************************/
//Compress the map-side output
//conf.setboolean ("Mapred.compress.map.output", true);
//Set the compression class used for map-side output
//conf.setclass ("Mapred.map.output.compression.codec", Gzipcodec.class, Compressioncodec.class);
//Compress the output of the reduce side
//conf.setboolean ("Mapred.output.compress", true);
//Set the compression class used by the output of the reduce end
//conf.setclass ("Mapred.output.compression.codec", Gzipcodec.class, Compressioncodec.class);
//Add config file (we can dynamically configure the information when programming, without having to change the cluster manually)
/*
* Conf.addresource ("Classpath://hadoop/core-site.xml");
* Conf.addresource ("Classpath://hadoop/hdfs-site.xml");
* Conf.addresource ("Classpath://hadoop/hdfs-site.xml");
*/
//Create File System
FileSystem FileSystem = Filesystem.get (new URI (Out_path), conf);
//If the output directory exists, we will delete
if (filesystem.exists (new Path (Out_path))) {
Filesystem.delete (new Path (Out_path), true);
}
//Create a task
Job Job = New Job (conf, wordcounttest. Class.getname ());
//1.1 Set Input directory and set the input data format class
Fileinputformat.setinputpaths (Job, Input_path);
Job.setinputformatclass (Textinputformat. Class);
//1.2 Setting the Custom mapper class and setting the type of key and value of the map function output data
Job.setmapperclass (mymapper. Class);
Job.setmapoutputkeyclass (Text. Class);
Job.setmapoutputvalueclass (longwritable. Class);
//1.3 sets the number of partitions and reduce (the number of reduce, which corresponds to the number of partitions, because the partition is one, so the number of reduce is also one)
Job.setpartitionerclass (Hashpartitioner. Class);
Job.setnumreducetasks (1);
//1.4 Sorting, grouping
//1.5 (Combiner can be shared with reducer)
Job.setcombinerclass (myreducer. Class);
//2.1 Shuffle copies data from the map side to the reduce side.
//2.2 Specifies the type of reducer class and output key and value
Job.setreducerclass (myreducer. Class);
Job.setoutputkeyclass (Text. Class);
Job.setoutputvalueclass (longwritable. Class);
//2.3 Specifies the path of the output and the format class for setting the output
Fileoutputformat.setoutputpath (Job, new Path (Out_path));
Job.setoutputformatclass (Textoutputformat. Class);
//Submit Job Exit
System.exit (Job.waitforcompletion (true)? 0: 1);
} catch (Exception e) {
E.printstacktrace ();
}
}
public static class Mymapper extends Mapper<longwritable, text, text, longwritable> {
//Define a Longwritable object as the value type of the map output
Longwritable oneTime = new Longwritable (1);
//Define a text object as the key type of the map output
Text Word = new text ();
protected void map (longwritable key, text value, mapper<longwritable, text, text, Longwritable>. Context context) throws IOException,
interruptedexception {
//Use Tab (\ t) for each row of records to split
String[] splits = value.tostring (). Split ("\ t");
//Iterate through a string array to output each word
For (String str:splits) {
//Set Word
Word.set (str);
//write the results out
Context.write (Word, oneTime);
}
}
}
public static class Myreducer extends Reducer<text, longwritable, Text, longwritable> {
//Define the value type of the Longwritable object's most reduce output
longwritable result = new longwritable ();
protected Void reduce (text key, iterable<longwritable> values, Reducer<text, longwritable, text, Longwritable>. Context context) throws IOException,
interruptedexception {
int sum = 0;
//Iterate through the collection, counting the occurrences of each word and
For (longwritable s:values) {
Sum + = S.get ();
}
//Set result
Result.set (sum);
//write the results out
Context.write (key, result);
}
}
}

Data used by the program:

Program Run Process:

(1): The input record for map is 3 i.e. <0,hello you> <10,hello me> <19,you me love>

(2): Map output is recorded as 7

(3): after sorting the grouped records for

(4): entry to combiner record is 7, after combiner results for

IV: Restrictions on the use of combiner

Not all cases can be used combiner,combiner the applicable scenario is a summary sum, the maximum value of the scene, but for the averaging of the scene is not applicable. Because if combiner is used in the averaging program, which is averaged by using combiner after each map, the average of each map calculation is averaged over the reduce side, and the results are not the same as the true average.

Hadoop combiner Components

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop combiner Components

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop combiner Components

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support