The nine--combiner,partitioner,shuffle and mapreduce sorting groupings for big data learning

Source: Internet
Author: User
Tags iterable shuffle

1.Combiner

Combiner is an optimization method for MapReduce. Each map can generate a large amount of local output, and the Combiner function is to merge the output of the map end first to reduce the amount of data transferred between the map and reduce nodes to improve network IO performance. The combiner can be set only if the operation satisfies the binding law.

The role of combiner:

(1) Combiner implements the aggregation of the local key and iterates over the key sort value of the map output:

Map: (K1, V1) →list (K2, V2) Combine: (K2, List (V2)) →list (K2, V2) Reduce: (K2, List (V2)) →list (K3, V3)

(2) Combiner also has a local reduce function (essentially a reduce)
For example, the WordCount example and the program that finds the maximum value for the value, combiner and reduce are exactly the same, as follows:

Map: (K1, V1) →list (K2, V2) Combine: (K2, List (V2)) →list (K3, V3) Reduce: (K3, List (V3)) →list (K4, V4)

With Combiner, the first map is aggregated locally, increasing speed. For the example of the wordcount that comes with Hadoop, value is a stacked number, so the value overlay of reduce can be done at the end of the map, without having to wait until all of the maps have finished to reduce the value overlay.

In the actual Hadoop cluster operation, we are the mapreduce with multiple hosts, and if we join the protocol operation, each host has a protocol to the native data before reduce, and then the reduce operation through the cluster, This saves reduce time considerably and speeds up the processing of mapreduce.

2.Partitioner

step1.3 is the partition operation, which key to which reducer the allocation process, is prescribed by partitioner.

The user uses a partition function on the middle key to partition the data and then enter into the successor task execution process. A default partitioning function uses the hash method (such as the common: hash (key) mod R) to partition. The hash method can produce very balanced partitions.

Self-customizing Partitioner functions:

Package mapreduce01;

Import java.io.IOException;

Import org.apache.hadoop.conf.Configuration;

Import Org.apache.hadoop.fs.Path;

Import org.apache.hadoop.io.IntWritable;

Import org.apache.hadoop.io.NullWritable;

Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;

Import Org.apache.hadoop.mapreduce.Mapper;

Import Org.apache.hadoop.mapreduce.Partitioner;

Import Org.apache.hadoop.mapreduce.Reducer;

Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Fenqu {

Static String input_path= "Hdfs://master:9000/test";

Static String output_path= "Hdfs://master:9000/output/fenqu";

Static class Mymapper extends mapper<object,object,intwritable,nullwritable>{

Intwritable output_key=new intwritable ();

Nullwritable Output_value=nullwritable.get ();

protected void Map (object key, object value, context context) throw ioexception,interruptedexception{

int Val=integer.parseunsignedint (value.tostring (). Trim ());

Output_key.set (Val);

Context.write (Output_key,output_value);

}

}

Static Class Liupartitioner extends Partitioner<intwritable,nullwritable> {

@Override

public int getpartition (intwritable key, nullwritable value, int numpartitions) {

int Num=key.get ();

if (num>100) return 0;

else return 1;

}

}

Static class Myreduce extends reducer<intwritable,nullwritable,intwritable,intwritable>{

Intwritable output_key=new intwritable ();

int num=1;

protected void reduce (intwritable key,iterable<nullwritable> values,context Context) throws IOException, interruptedexception{

Output_key.set (num++);

Context.write (Output_key,key);

}   }

public static void Main (string[] args) throws exception{

Path outputpath=new path (Output_path);

Configuration conf=new configuration (); |

Fileinputformat.setinputpaths (Job, Input_path);

Fileoutputformat.setoutputpath (Job,outputpath);

Job.setmapperclass (Mymapper.class);

Job.setreducerclass (Myreduce.class);

Job.setnumreducetasks (2);

Job.setpartitionerclass (Liupartitioner.class);

Job.setmapoutputkeyclass (Intwritable.class);

Job.setmapoutputvalueclass (Nullwritable.class);

Job.setoutputkeyclass (Intwritable.class);

Job.setoutputvalueclass (Intwritable.class);

Job.waitforcompletion (TRUE);

}

}

The main function of partition Partitioner lies in the following two points:
Generate multiple output files based on business needs, and multiple reduce tasks run concurrently to improve the overall job efficiency.

3.Shuffle process

Three steps in the reduce phase:

step2.1 is a shuffle "random, Shuffle" operation

What shuffle is: The output for multiple map tasks is replicated across the network to different reduce task nodes by different partitions (Partition), which is called Shuffle.

On the map side:

1. On the map side first is Inputsplit, in the inputsplit contains the data in Datanode, each inputsplit will be assigned a mapper task, mapper after the end of the task to produce <K2,V2> output, These outputs are first stored in the cache, and each map has a ring memory buffer to store the output of the task. Default Size 100MB (IO.SORT.MB property), once the threshold 0.8 (Io.sort.spil l.percent) is reached, a background thread writes the contents to (spill) A new overflow write file under the specified directory (MAPRED.LOCAL.DIR) in the Linux local disk.

2. Before writing the disk, do partition, sort, and combine. By partitioning, different types of data are processed separately, then the data of different partitions are sorted, and if there is combiner, the sorted data is combine. When the last record is finished, merge all the overflow files into one partition and sort the files.

3. Finally, the data in the disk is sent to reduce, the map output has three partitions, a partition data is sent to the reduce task shown, the remaining two partitions are sent to other reducer tasks. The other three inputs of the reducer task shown are derived from the map output from the other nodes.

Reduce side:

1. Copy phase: Reducer The output file partition by HTTP mode.
The reduce side may fetch data from the results of n maps, which are performed differently, and when one of the maps runs at the end, reduce will get that information from Jobtracker. After the map is run, Tasktracker will get the message and then report the message to jobtracker,reduce timing to get the information from Jobtracker, which has 5 data replication threads copying the data from the map side by default on the reduce side.

2.Merge Stage: Merge if multiple disk files are formed
The data copied from the map end is written to the cache in the reduce side, and the cache occupies a certain threshold and writes the data to disk, as well as partition, combine, sorting, and so on. If more than one disk file is formed, the result of the last merge is taken as input to reduce and not written to disk.

3.Reducer parameters: Finally, the merged results are passed into the reduce task as input.

4. Sorting sort

step4.1 The fourth step requires sorting and grouping of data in different partitions, sorted and grouped by key by default.

The custom type MYGROUPTESTT implements the Writablecomparable interface, which has a CompareTo () method, which is called when the key is compared, and we change it to our own definition of the comparison rule. So that we can achieve the desired effect.

Custom sort:

Groupsort.java

Package mapreduce01;

Import java.io.IOException;

Import Mapreduce01.fenqu.LiuPartitioner;

Import Mapreduce01.fenqu.MyMapper;

Import Mapreduce01.fenqu.MyReduce;

Import org.apache.hadoop.conf.Configuration;

Import Org.apache.hadoop.fs.Path;

Import org.apache.hadoop.io.IntWritable;

Import org.apache.hadoop.io.LongWritable;

Import org.apache.hadoop.io.NullWritable;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapreduce.Job;

Import Org.apache.hadoop.mapreduce.Mapper;

Import Org.apache.hadoop.mapreduce.Reducer;

Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Groupsort {

Static String input_path= "Hdfs://master:9000/input/f.txt";

Static String output_path= "Hdfs://master:9000/output/groupsort";

Static class Mymapper extends mapper<object,object,mygrouptest,nullwritable>{

Mygrouptest output_key=new mygrouptest ();

Nullwritable Output_value=nullwritable.get ();

protected void Map (object key, object value, Context context) throws IOException, interruptedexception{

String[] Tokens=value.tostring (). Split (",", 2);

Mygrouptest output_key=new mygrouptest (Long.parselong (Tokens[0]), Long.parselong (tokens[1]));

Context.write (Output_key,output_value);

}

}

Static class Myreduce extends reducer<mygrouptest,nullwritable,longwritable,longwritable>{

Longwritable output_key=new longwritable ();

Longwritable output_value=new longwritable ();

protected void reduce (mygrouptest key,iterable<nullwritable> values,context Context) throws IOException, interruptedexception{

Output_key.set (Key.getfirstnum ());

Output_value.set (Key.getsecondnum ());

Context.write (Output_key,output_value);

}

}

public static void Main (string[] args) throws exception{

Path outputpath=new path (Output_path);

Configuration conf=new configuration ();

Job job=job.getinstance (conf);

Fileinputformat.setinputpaths (Job, Input_path);

Fileoutputformat.setoutputpath (Job,outputpath);

Job.setmapperclass (Mymapper.class);

Job.setreducerclass (Myreduce.class);

Job.setnumreducetasks (1);

Job.setpartitionerclass (Liupartitioner.class);

Job.setmapoutputkeyclass (Mygrouptest.class);

Job.setmapoutputvalueclass (Nullwritable.class);

Job.setoutputkeyclass (Longwritable.class);

Job.setoutputvalueclass (Longwritable.class);

Job.waitforcompletion (TRUE);

}

}

Mygrouptest.java

Package mapreduce01;

Import Java.io.DataInput;

Import Java.io.DataOutput;

Import java.io.IOException;

Import org.apache.hadoop.io.WritableComparable;

public class Mygrouptest implements writablecomparable<mygrouptest> {

Long Firstnum;

Long Secondnum;

Public mygrouptest () {}

Public Mygrouptest (long first, long second) {

Firstnum = First;

Secondnum = second;

}

@Override

public void Write (DataOutput out) throws IOException {

Out.writelong (Firstnum);

Out.writelong (Secondnum);

}

@Override

public void ReadFields (Datainput in) throws IOException {

Firstnum = In.readlong ();

Secondnum = In.readlong ();

}/* * The following Compreto method is called when key is sorted */

@Override

public int CompareTo (Mygrouptest anotherkey) {

Long min = firstnum-anotherkey.firstnum;

if (min! = 0) {//indicates that the first column is not equal, returns a small number between two numbers

return (int) min;

}

else {

return (int) (secondnum-anotherkey.secondnum);

}

}

Public long Getfirstnum () {return firstnum; }

Public long Getsecondnum () {return secondnum; }

}

The nine--combiner,partitioner,shuffle and mapreduce sorting groupings for big data learning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.