Hadoop's SecondarySorting

Source: Internet
Author: User
In the past few days, a problem occurs when Hadoop is used in the project. For such a key-value data set: id-bizobject, the id is partition (for example, according to a specific hash algorithm P ), it can be divided into a parts. If the number of reducers is B, a third-party component should be used for batch upload in the reducers. If the number of files uploaded to a file is c, there are two parts:

In the past few days, a problem occurs when Hadoop is used in the project. For such a key-value data set: id-biz object, the id is partition (for example, according to a specific hash algorithm P ), it can be divided into a parts. If the number of reducers is B, a third-party component should be used for batch upload in the reducers. If the number of files uploaded to a file is c, there are two parts:

In the past few days, a problem occurs when Hadoop is used in the project. For such a key-value data set: id-biz object, the id is partition (for example, according to a specific hash algorithm P ), it can be divided into a parts. If the number of reducers is B, a third-party component should be used for batch upload in the reducers. If the number of files uploaded to a file is c, there are two requirements:

  • The above a, B, and c are all equal, so that the data of each partition is finally uploaded to the same file through the same CER;
  • The data uploaded in each CER must be ordered.

At first, we thought of A way to ensure batch upload in cer CER, we need to make the key passed into reducer into an index calculated by hash algorithm, in this way, the value in cer CER is an iterator that contains several biz boject sets. This enables batch upload and submission in one reducer call. During the batch upload and submission process, you can submit each maximum (for example, 1000) file to ensure that the memory usage is within a certain range.

How to ensure order?

Hadoop Automatically sorts keys before Reduce, but the above situation is actually to sort values by id (because the key has changed to index after map ), for Sorting values, use Hadoop's Secondary Sorting (see the stackoverflow link ).

This figure shows that the key attribute of the value to be sorted is put into the key, so that the key becomes the natural key (the index above) and the secondary key (the id above) A composite key composed of the two parts.

1. Partition: only natural key is used for Partition to ensure that all index data is classified into the same partition;

JobConf.setPartitionClass(...);

2. sort: The comparison algorithm used to Sort keys must Sort the natural key and secondary key to ensure that keys are sorted in the id dimension, id and value correspond one by one, so value is ordered.

JobConf.setOutputKeyComparatorClass(...);

3. Group: The grouping comparison algorithm ignores the secondary key and only applies to natural keygrouping, so that all data belonging to the same index is moved to the same CER Cer.

JobConf.setOutputValueGroupingComparatorClass(...);

In conclusion, in cer CER, the input key is a composite key object, including the index and id, input value is an object of the original biz object type that can be traversed.

Afterwards: This is the process of Secondary Sorting, which can solve my problem, but later I found that, in fact, my problem does not need to be solved in this way:

  • Only the id of the key that enters CER is required, and Hadoop Automatically sorts the key;
  • The partition policy remains unchanged, but it calculates the index in partitioner and uses it for partition;
  • You do not need to specify the Grouping and Sorting algorithms separately;
  • Create a container object p with the maximum size (for example, 1000) in cer Cer.

In this way, since the data of each partition is processed in the same CER, each reduce method in cer CER is sorted by id, you can put the data in p during each call and submit the data once when p is full.

Test passed. Looking back, I thought the problem was complicated at the beginning.

The article is original to me without special instructions and shall not be used for any commercial purposes without permission. repost the article to ensure integrity and indicate the source link "four fires nagging"

You may also like:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.