Hadoop sharding, grouping, and sorting

Source: Internet
Author: User

The first thing to be clear is that the key in Hadoop must be sortable, either the key itself implements the Writablecomparator interface, or a sort class can sort the key. If the key itself does not implement the Writablecomparator interface, but is provided by another tool class (Implementing the Rawcomparator Interface) to provide sorting, you need to set the sorting class of key separately:
Job.setoutputkeycomparatorclass (Xxx.class);
When the map is output, shards are made and the key is sorted on the chip. The purpose of a shard is to determine which reduce to distribute to, and the reason for sorting is to base the order of reduce in the latter phase so that it is faster to merge the sort.
After the reduce side collects the output from many map nodes, it also sorts by key. Sorting is either based on the provided separate sort class, if not, it is required that key must implement the Writablecomparator interface, otherwise the cast will report an exception.
In the reduce method we write, value is the value of an iteration in the received parameter, and the framework places the V value of the key "Same" k-v in an iterator. The key parameter of the reduce method, obtained is the first k-v K value. Whether key is the same is determined by the business, unlike the absolute comparison of digital 1=1. This process is called grouping. K-v within the same group, processed by the same reduce method. Grouping requires a grouping method to determine which k-v are in a group. The grouping method compares the value of a key. If a separate packet is provided, a separate grouping is used to group it, otherwise the default behavior is to compare the key (the Compare method of key itself or the individual comparison method), which is more consistent and is placed in a group. Sometimes, although key is different, but you want them in a group, at this point, you need to provide a separate grouping method. Set by the Job.setoutputvaluegroupingcomparator () method. When this key is different, but in the same group, the key passed to the reduce method that we write is because it takes the first k-v K value, then the order of K is very important. By sorting, the required K-v are ranked first, which can be achieved for some purpose. In the case of a joint investigation.

For example: There are two files, one is City.txt, one is the city number and the city name is recorded in the person.txt,city, comma separated, the person file is the city number and name, want to finally get the name-city name of the result.

This method has a lot of solutions, here is one: to find a way to the people of the same city, including the name of the city in a group, while the city name in the first place, then on the reduce side, take the first value is the name of the city, the rest is the name of the person.

City.txt

1,gz

2,zh

3,dg


Person.txt

1,lili

2,huangq

2,chaojie

3,pengming

3,duw


Define a structure as key:

Cityperson Implements writablecomparator{

int Cityid;

int flag;

}

The flag=0 of the Flag=1,person of the city is agreed.

The sorting method is flag=1 in front of the line.

@Override
public int compareTo (Cityperson o) {

if (Cityid==o.cityid) {

Big in the front

if (Flag>o.flag) {return-1;}

else if (Flag<o.flag) {return 1;}

return 0;

}

Return (Cityid>o.cityid)? 1:-1;

}


After the final ordering of the reduce end, so the k-v are lined up, and, the same Cityid, Flag=1 will be ranked in front.
Because of this cityperson comparison method, has not been used to group (the same Cityid, the comparison of different flags is not 0, will not be placed in a group, and the requirement is Cityid the same need to put in a group), so, need to provide a separate packet,
public class Groupcomparator implements rawcomparator<cityperson>{
@Override
public int Compare (Cityperson O1, Cityperson O2) {
if (O1.cityid==o2.cityid) {
return 0;
}
Return (O1.cityid>o2.cityid)? 1:-1;
}
@Override
public int Compare (byte[] arg0, int arg1, int arg2, byte[] Arg3,
int arg4, int arg5) {
return 0;
}
}
Compare Cityid only.

Hadoop sharding, grouping, and sorting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.