Hadoop---mapreduce sorting and two ordering and full ordering

Last Update:2018-07-01 Source: Internet

Author: User

Tags comparable serialization

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learn to sort by yourself and sort the two times with the following knowledge. Description of the serialization format for 1.Hadoop: Writable2.hadoop key sort logic 3. Full sort 4. How to customize your own writable Type 5. How to implement a two-order 1.hadoop serialization Format Description: Writable the first knowledge point you must know to understand and write the Mr Implementation order is the writable related interfaces and classes, which are the Hadoop own serialization format. More likely to be concerned with his subinterfaces:writablecomparable<t>. He inherits the writable and Comparable<t> interfaces, and then the implementation of writablecomparable<t>, in addition to having serialization features, more importantly, has the characteristics of comparison, The characteristic of comparison is very important in mapreduce, because there is a key-based sort procedure in Mr, so the type that can be used as a key must have comparable<t> characteristics. In addition to the Writablecomparable interface, there is an interface rawcomparaotor. The difference between the writablecomparable and the Rawcomparator two interfaces is that writablecomparable is the need to deserialize the data stream into an object and then make comparisons between objects, Rawcomparator is a direct comparison of data flow data, do not need to deserialize the data stream into an object, eliminating the overhead of new objects. 2.Hadoop key sort logic the sort logic of Hadoop itself key's data type is actually dependent on the basic data type and other types of the writablecomparable<t> of the Hadoop itself (related types can refer to the The definition of the CompareTo method for the second edition of the Hadoop authoritative guide (90 pages). Key sort rules: 1. If you call Jobconf's Setoutputkeycomparatorclass () Set MAPRED.OUTPUT.KEY.COMPARATOR.CLASS2. Otherwise, use key already registered Comparator3. Otherwise, implement the CompareTo of the Interface Writablecomparable () function to manipulate the comparison algorithm for example intwritable as follows: Java code

public int compareto ( Object o) {
int thisvalue = this.value;
int thatvalue = ((intwritable) O). value;
return (thisvalue<thatvalue ? -1 : (Thisvalue==thatvalue ? 0 : 1));
}

You can modify the CompareTo to implement the comparison algorithm that you want. While we know that CompareTo is the way to implement key sequencing, we don't need to focus on how this sort is implemented when using Hadoop's basic data types, because the Hadoop framework automatically calls CompareTo to sort the key. But this sort is confined to the map or reduce inside. The sort of compareto between map and map,reduce and reduce is not an issue, although this is not always the case, but it does exist, and there are scenarios, such as full ordering. 3. Full sequencing Here you need to focus on partition this phase, the partition phase is for each reduce, you need to create a partition, and then map the output of the map to a specific partition. This partition may have data corresponding to n keys, but all data for a key can only be in one partition. In the process of full sequencing, if there is only one reduce, that is, only one partition, then the output of all maps goes through a partition to a reduce, in a reduce can be based on CompareTo ( Other comparison algorithms can also be used to sort, to achieve full ordering. But this situation allows MapReduce to lose the halo of distributed computing. So the general idea of the full order is: to ensure that the partition is orderly between the OK, that is, to ensure that the maximum value of Partition1 is less than the minimum value of Partition2 OK, even if this is still a problem: partition distribution is uneven, The amount of data that may result in some partition processing is much larger than the amount of data processed by other partition. The core steps to achieve full sequencing are: sampling and partition. First "sampling" to ensure that the partition more evenly: 1) math.min (ten, Splits.length) split (input shard) random sampling, 10,000 samples for each split, a total of 100,000 samples
2) 100,000 sample sorting, according to the number of reducer (n), take out the interval average of n-1 samples
3) Write this n-1 sample to Partitionfile (_partition.lst, is a sequencefile), key is the sample, the value is Nullvalue
4) A detailed description of the partitionfile written to distributedcache full order can be referred to: http://www.iteye.com/topic/7099864. How to customize your own writable type to customize your own writable type of scenario should be simple: Hadoop comes with data types that are either functionally unmet or performance-satisfying, after all, Hadoop is still evolving, not all situations, But he provides an autonomous framework to achieve the functionality we want. Defining your own writable type needs to be implemented: a. Overloaded constructors B. Implementing set and Get methods C. Methods for implementing Interfaces: Write (), ReadFields (), CompareTo () d. (optional) equivalent to the Java construction object, Rewrite Java.lang.Object's hashcode (), Equals (), toString (). Partition the default hashpartitioner will select the partition according to Hashcode (), if you do not partition the custom type key, Hashcode () Do not implement specific examples can refer to the basic type of Hadoop intwritable Java code Implementation

Public class Intwritable implements Writablecomparable {
private int value;
Public intwritable () {}
Public intwritable (int value) {set (value);}
/** Set The value of this intwritable. * *
public void Set (int value) { this.value = value;}
/** Return The value of this intwritable. * *
public int get () { return value;}
public void ReadFields (Datainput in) throws IOException {
Value = In.readint ();
}
public Void Write (DataOutput out) throws IOException {
Out.writeint (value);
}
/** Returns True iff <code>o</code> is a intwritable with the same value. * /
public Boolean equals (Object o) {
if (! ( o instanceof intwritable))
return false;
intwritable other = (intwritable) o;
return this.value = = Other.value;
}
public int hashcode () {
return value;
}
/** compares, intwritables. * /
public int compareTo (Object o) {
int thisvalue = this.value;
int thatvalue = ((intwritable) O). Value;
return (thisvalue<thatvalue?-1: (Thisvalue==thatvalue? 0: 1));
}
Public String toString () {
return integer.tostring (value);
}
}

5. How to achieve two sort two order the working principle involves the following aspects: A. Create the data type of key, key to include two order of elements B.setpartitionerclass (class<? extends partitioner> Theclass)

the function after hadoop0.20.0 is Setpartitionerclass

C.setoutputkeycomparatorclass (class<? extends rawcomparator> Theclass)

the function after hadoop0.20.0 is Setsortcomparatorclass

D.setoutputvaluegroupingcomparator (class<? extends rawcomparator> Theclass)

the function after hadoop0.20.0 is Setgroupingcomparatorclass

According to the example:org.apache.hadoop.examplesSecondarySort provided by Hadoop, the two-time ordering is exactly how it is implemented.

Secondarysort implements the inner classes of Intpair, Firstpartitioner, Firstgroupingcomparator, Mapclass, and reduce, and is then called in the main function. Let's start by explaining what's different from the normal Mr Code in the main function. The difference is that the two sets are more:

Job.setpartitionerclass (firstpartitioner.class);

set the custom partition operation, here is to call our custom inner class Firstpartitioner

Job.setgroupingcomparatorclass (firstgroupingcomparator.class);

The

iterator that sets which value enters which key, where the custom inner class is called Firstgroupingcomparator

The specific operating logic is:

A. Define a type Intpair as key, there are two variable first in Intpair, and Second,secondarysort is to sort second and then sort
B. Define the first sort of partition function class Firstpartitioner,key. In the Firstpartitioner implementation of how to handle key first, the data corresponding to the key is divided into different partitions. In this way, the same value of first in key will be placed in the same reduce, then the second order in reduce C (code is not implemented, in fact, there is processing). Key comparison function class, Key's second order, is a comparator that inherits Writablecomparator. Setsortcomparatorclass can be implemented.

Why not use Setsortcomparatorclass () is because of the Hadoop rules for key sorting (see 2. The key sort logic for Hadoop). Since we have defined the CompareTo () function in Intpair.

D. Define the Grouping function class Firstgroupingcomparator, guaranteeing that value enters the value iterator of key as long as the first part of the key is the same

Hadoop---mapreduce sorting and two ordering and full ordering

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More