mapreduce-three orders-two times a sort

Source: Internet
Author: User
Tags comparison sort split
The previous blog explains how to customize key, and use the two-order example to do the test, but no detailed description of the two-order, this is a detailed description of two orders, in order to illustrate once a thought of the misunderstanding, specifically done a 3 field two order to illustrate. It is referred to as "three-time ordering".
Test data:
A1,b2,c5
A4,b1,c3
A1,b2,c4
A2,b2,c4
A2,b1,c4
A4,b1,c2
Test Purpose: Outputs the following results are first sorted according to the first self-paragraph, if the first field is sorted out and then ranked according to the second field in ascending order, the following results are obtained based on the third field.
A1 B2,C4
A1 B2,C5
A2 B1,C4
A2 B2,C4
A4 B1,c2
A4 b1,c3
Three fields are set up to illustrate a problem that has plagued me for a long time, and the question is probably described as follows: The following online study of a great God tells the MapReduce of his two times the principle of a paragraph description
Two-time sorting principle
In the map phase, the input dataset is split into small chunks splites using InputFormat defined by Job.setinputformatclass, while InputFormat provides a recordreder implementation. In this example, Textinputformat is used, and the recordreader that he provides will take the byte offset of the text as key, which is the text of the line as value. This is why the input to the custom map is <longwritable, text>. Then call the map method of the custom map, and <longwritable, text> the map method for the input to map. Note that the output should conform to the output <intpair defined in the custom map, intwritable>. The end result is a list<intpair, intwritable>.
The process of sorting: (At that time understanding the first sort, sorting only the first field in a custom type)
At the end of the map phase, Job.setpartitionerclass is called to partition the list, with each partition mapped to a reducer. The key comparison function class ordering for the Job.setsortcomparatorclass setting is called within each partition. As you can see, this is in itself a two-time sort. If the key comparison function class is not set by Job.setsortcomparatorclass, the CompareTo method of the implementation of key is used.
The process of sorting: (At that time understanding the second sort, sorting the second field in the custom type)
In the reduce phase, reducer receives all map outputs mapped to this reducer, and is also the key comparison function class that calls Job.setsortcomparatorclass settings to sort all data pairs. It then begins to construct a value iterator corresponding to the key. In this case, a grouping is used, and the Grouping function class set with Job.setgroupingcomparatorclass (if not set will determine if all the fields in key are the same, comparing the bytes of the entire object stream). As long as the comparator compares the same two keys, they belong to the same group, their value is placed in a value iterator, and the key of the iterator uses the first key of all keys that belong to the same group. The final step is to enter Reducer's reduce method, and the input to the reduce method is all (key and its value iterator). Also note that the type of the input and output must be consistent with the declaration in the custom reducer.
Core Summary:
1, the last stage of map partition partition, generally use Job.setpartitionerclass set class, if there is no custom key Hashcode () method to partition.
2, each partition internal call Job.setsortcomparatorclass set the key of the comparison function class to sort, if not the implementation of the CompareTo method using key.
3. When reduce receives all the data transmitted by the map, call the Job.setsortcomparatorclass set key comparison function class to sort all data pairs, if not the CompareTo method that uses the implementation of key.
4, immediately after using the Job.setgroupingcomparatorclass set up the Grouping function class, grouping, the same key value is placed in an iterator inside. If Groupingcomparatorclass is not specified, it is grouped using the CompareTo method of the key's implementation.
Here's the idea of the error when I first started thinking about the two-time sort of data flow: if it's multiple maps to the same reduce data: Data flow doesn't have to be like this.
A4,B4 A3,B3
A3,B3 A4,B4
---------------------->
A2,B1 A1,B2
A1,B2 A2,B1
If reduce receives the last block of data, then the Hadoop framework sorts the first field, when the second field is ordered, but why is the result ordered again?
So I did this example of the experiment, with three fields to sort, the results of the experiment is the result of this example, and finally sigh with their own apprenticeship is not fine ah, the essence of the sort is not based on CompareTo do the complete
So the data from the map end to the reduce side should look like this: we've done a complete sequencing.
A4,B4 A3,B3
A3,B3 A4,B4
---------------------->
A2,B1 A1,B1
A1,B2 A2,B2
Having understood the principle of two ordering, we will begin to implement the above features:

Define a custom data type: the public int compareTo (Thirdsortclass O) method inside is "the key to three ordering"

Custom key:

Import Java.io.DataInput;
Import Java.io.DataOutput;
Import java.io.IOException;

Import org.apache.hadoop.io.WritableComparable;
	 The variables contained in the public class Thirdsortclass implements writablecomparable<thirdsortclass> {/** * custom type, and the variables in this example are the variables used for sorting
	* We will also define variables for some other functions in the post-sequential case */private String first;
	Private String second;
	Private String third;
		Public Thirdsortclass () {} public Thirdsortclass (string first, string second, string third) {This.first = first;
		This.second = second;
	This.third = third; }/** * deserialization, converting from binary in stream to custom key */@Override public void ReadFields (Datainput input) throws IOException {this.fi
		rst = Input.readutf ();
		This.second = Input.readutf ();
	This.third = Input.readutf (); }/** * serialization, converting a custom key to binary */@Override public void Write (DataOutput output) throws IOException {OUTPUT.WR
		Iteutf (first);
		Output.writeutf (second);
	Output.writeutf (third);
		} @Override public int hashcode () {final int prime = 31; int result = 1;
		result = Prime * result + ((first = = null)? 0:first.hashcode ());
		result = Prime * result + ((second = = null)? 0:second.hashcode ());
		result = Prime * result + ((third = = null)? 0:third.hashcode ());
	return result;
		} @Override public boolean equals (Object obj) {if (this = = obj) return true;
		if (obj = = null) return false;
		if (getclass () = Obj.getclass ()) return false;
		Thirdsortclass other = (thirdsortclass) obj;
		if (first = = null) {if (Other.first! = null) return false;
		} else if (!first.equals (Other.first)) return false;
		if (second = = null) {if (Other.second! = null) return false;
		} else if (!second.equals (Other.second)) return false;
		if (third = = null) {if (Other.third! = null) return false;
		} else if (!third.equals (Other.third)) return false;
	return true; }/** * For the order of the map and reduce phases, and the grouping grouping of the reduce phase * Here is the key to two ordering, the implementation of the two-order function is mainly in this method */@Override public int Compar ETo (Thirdsortclass o) {
		if (!this.first.equals (O.getfirst ())) {return This.first.compareTo (O.getfirst ());
		} else if (!this.second.equals (O.getsecond ())) {return This.second.compareTo (O.getsecond ());
		} else if (!this.third.equals (O.getthird ())) {return This.third.compareTo (O.getthird ());
	} return 0;
	} public String GetFirst () {return first;
	} public void Setfirst (String first) {This.first = first;
	} public String Getsecond () {return second;
	} public void Setsecond (String second) {this.second = second;
	} public String Getthird () {return third;
	} public void Setthird (String third) {this.third = third; }
}

Map phase:

Import java.io.IOException;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Mapper;

public class Thirdmapper extends Mapper<longwritable, Text, Thirdsortclass, text> {
	@Override
	protected void map (longwritable key, Text value, context context)
			throws IOException, interruptedexception {
		String line = V Alue.tostring (). Trim ();
		if (line.length () > 0) {
			string[] arr = Line.split (",");
			if (arr.length = = 3) {
				context.write (new Thirdsortclass (arr[0],arr[1], arr[2]), new Text (arr[1] + "," + arr[2]); 
  }
		}
	}
}
Reduce phase:

Import java.io.IOException;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Reducer;

public class Thirdsortreducer extends Reducer<thirdsortclass, text, text, text> {
	private Text okey = new text () ;
	@Override
	protected void reduce (Thirdsortclass key, iterable<text> values, context context)
			throws IOException, Interruptedexception {
//For		(Text val:values) {
//			Context.write (new text ( Key.getfirst ()), Val);		}
		Okey.set (Key.getfirst ());
		Context.write (Okey, Values.iterator (). Next ());
	}
}
Start function:

Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.Path;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Jobmain {public static void main (string[] args) throws exception{configuration = new Confi
		Guration ();
		Job Job = new Job (configuration, "third-sort-job");
		Job.setjarbyclass (Jobmain.class);
		Job.setmapperclass (Thirdmapper.class);
		Job.setmapoutputkeyclass (Thirdsortclass.class);
		Job.setmapoutputvalueclass (Text.class);
		 /** * There's no use for partitioner, so far we've all done a reduce test, * when you explain global ordering later, you'll also highlight multiple reduce formulations.
		* There will be an example to illustrate the Partitioner customization, to explain the sum of the odd even line will be used when *//Job.setpartitionerclass (Thirdsortpatitioner.class);
		Job.setreducerclass (Thirdsortreducer.class);
		Job.setoutputkeyclass (Text.class);
		Job.setoutputvalueclass (Text.class); /** * To avoid the blog too long, here also does not explain the use of the Setgroupingcomparatorclass () method, because * Use this comparison, reduce the wording will have some differences, there is a special blog post on When * Use this method, and Mapre
		The group process of the Duce is made to compare *///Job.setgroupingcomparatorclass (Thirdsortgroupingcomparator.class);
		Fileinputformat.addinputpath (Job, New Path (Args[0]));
		Path OutputDir = new Path (args[1]);
		FileSystem fs = filesystem.get (configuration);
		if (fs.exists (OutputDir)) {Fs.delete (OutputDir, true);
		} fileoutputformat.setoutputpath (Job, OutputDir);
	System.exit (Job.waitforcompletion (true)? 0:1); }
}
Operation Result:


Summarize:

In this blog left a problem, is the whole process of sequencing is the CompareTo () method of the key, but not all cases can do so, and the specification of a bit of writing is also a separate implementation of their own comparator, here is for the concise article, highlighting a point to illustrate, And not let other factors affect the explanation of the points to be explained. But where it is not, the blog post will be a skill point to illustrate. The next article explains how to customize Groupingcomparatorclass and Sortcomparatorclass.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.