Hadoop sharding, grouping, and sorting

Last Update:2018-01-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first thing to be clear is that the key in Hadoop must be sortable, either the key itself implements the Writablecomparator interface, or a sort class can sort the key. If the key itself does not implement the Writablecomparator interface, but is provided by another tool class (Implementing the Rawcomparator Interface) to provide sorting, you need to set the sorting class of key separately:
Job.setoutputkeycomparatorclass (Xxx.class);
When the map is output, shards are made and the key is sorted on the chip. The purpose of a shard is to determine which reduce to distribute to, and the reason for sorting is to base the order of reduce in the latter phase so that it is faster to merge the sort.
After the reduce side collects the output from many map nodes, it also sorts by key. Sorting is either based on the provided separate sort class, if not, it is required that key must implement the Writablecomparator interface, otherwise the cast will report an exception.
In the reduce method we write, value is the value of an iteration in the received parameter, and the framework places the V value of the key "Same" k-v in an iterator. The key parameter of the reduce method, obtained is the first k-v K value. Whether key is the same is determined by the business, unlike the absolute comparison of digital 1=1. This process is called grouping. K-v within the same group, processed by the same reduce method. Grouping requires a grouping method to determine which k-v are in a group. The grouping method compares the value of a key. If a separate packet is provided, a separate grouping is used to group it, otherwise the default behavior is to compare the key (the Compare method of key itself or the individual comparison method), which is more consistent and is placed in a group. Sometimes, although key is different, but you want them in a group, at this point, you need to provide a separate grouping method. Set by the Job.setoutputvaluegroupingcomparator () method. When this key is different, but in the same group, the key passed to the reduce method that we write is because it takes the first k-v K value, then the order of K is very important. By sorting, the required K-v are ranked first, which can be achieved for some purpose. In the case of a joint investigation.

For example: There are two files, one is City.txt, one is the city number and the city name is recorded in the person.txt,city, comma separated, the person file is the city number and name, want to finally get the name-city name of the result.

This method has a lot of solutions, here is one: to find a way to the people of the same city, including the name of the city in a group, while the city name in the first place, then on the reduce side, take the first value is the name of the city, the rest is the name of the person.

City.txt

1,gz

2,zh

3,dg

Person.txt

1,lili

2,huangq

2,chaojie

3,pengming

3,duw

Define a structure as key:

Cityperson Implements writablecomparator{

int Cityid;

int flag;

}

The flag=0 of the Flag=1,person of the city is agreed.

The sorting method is flag=1 in front of the line.

@Override
public int compareTo (Cityperson o) {

if (Cityid==o.cityid) {

Big in the front

if (Flag>o.flag) {return-1;}

else if (Flag<o.flag) {return 1;}

return 0;

}

Return (Cityid>o.cityid)? 1:-1;

}

After the final ordering of the reduce end, so the k-v are lined up, and, the same Cityid, Flag=1 will be ranked in front.
Because of this cityperson comparison method, has not been used to group (the same Cityid, the comparison of different flags is not 0, will not be placed in a group, and the requirement is Cityid the same need to put in a group), so, need to provide a separate packet,
public class Groupcomparator implements rawcomparator<cityperson>{
@Override
public int Compare (Cityperson O1, Cityperson O2) {
if (O1.cityid==o2.cityid) {
return 0;
}
Return (O1.cityid>o2.cityid)? 1:-1;
}
@Override
public int Compare (byte[] arg0, int arg1, int arg2, byte[] Arg3,
int arg4, int arg5) {
return 0;
}
}
Compare Cityid only.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop sharding, grouping, and sorting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop sharding, grouping, and sorting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support