Hadoop uses MapReduce to sort ideas, globally sort

Last Update:2015-01-19 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper mainly deals with the ordering of key, mainly using the mechanism of Hadoop.

1, Partition

The partition function is to distribute the results of the map to multiple reduce. Of course, multiple reduce can reflect the benefits of distributed.

2. Ideas

As each partition internal is orderly, so long as the partition between the orderly, you can ensure that all orderly.

3. Questions

With the idea of how to define the boundaries of partition, this is a problem.

Workaround:Hadoop provides a sampler to help us estimate the entire boundary so that the data is allocated as evenly as possible

Reference: http://stblog.baidu-tech.com/?p=397

2. Examples of Hadoop applications: sequencing of large-scale data

The Hadoop platform does not provide global data sequencing, and global sorting of data in large-scale data processing is a very common requirement. A large number of data processing tasks that divide large-scale tasks into small data sizes must first globally sort large-scale data. For example, to handle the merging of attributes of two sets of large datasets, the two sets of data can be globally sorted and then decomposed into a series of small two-way merge problem implementations.

2.1 Methods of using Hadoop for global sorting of large-scale data

Using Hadoop to sort a lot of data sort the most straightforward way is to give the file all the content to map, map without any processing, direct output to a reduce, using Hadoop's own shuffle mechanism, all the data sorted, and then by the reduce direct output.

However, such a method is no different from the single machine, it is completely unable to use multi-computer distributed computing convenience. Therefore, this method is not possible.

Using the computing model of Hadoop divide and conquer, we can refer to the idea of fast sorting. Here, let's briefly recall the quick sort. Quick sort the basic step is to select one of the current data as the pivot point. Then put on one side greater than this fulcrum, less than this fulcrum on the other side.

Imagine if we had n pivots (which can be called rulers here), we could divide all the data into n+1 part, throw this n+1 part to reduce, sort it automatically by Hadoop, output n+1 internal ordered files, and merge the n+1 files into one file. Call it

As a result, we can summarize the steps of using Hadoop to sort large amounts of data:

1) sampling of sorted data;

2) Sorting the sampled data to produce the ruler;

3) Map calculates the two rulers of each data input, and sends the data to reduce of the corresponding interval ID.

4) reduce will get direct data output.

Here's an example of using a set of URLs to sort by:

Here's a little bit of a problem to deal with: How do I send data to a specific ID of reduce? Hadoop provides a variety of partitioning algorithms. These algorithms determine which reduce the data should be sent based on the key of the data that the map outputs (the sort of reduce is also dependent on key). Therefore, if you need to send the data to a reduce, as long as the output of the data at the same time, provide a key (in the above example is the id+url of reduce), the data should go where to go.

2.2 Precautions

1) The extraction of the ruler should be as uniform as possible, which is consistent with the selection of many variant algorithms which emphasize the fulcrum of fast sequencing.

2) HDFs is a file system with very asymmetric reading and writing performance. As far as possible the use of its high-performance characteristics of reading. Reduce reliance on write files and shuffle operations. For example, when data processing needs to be determined based on the statistics of the data. Dividing statistics and data processing into two rounds of map-reduce is much faster than combining statistics and data processing in one reduce.

3. Summary

Hadoop is a data-driven computing model, combined with MapReduce and HDFs, the task runs on the compute node of data storage, takes full advantage of the storage and computing resources of compute nodes, and greatly saves the overhead of network transmission data.

Hadoop provides a convenient platform for parallel computing with clusters. Various operational models that can isolate the dependencies between datasets can be well-applied on Hadoop. Then there will be more introduction to the large-scale data-base computing methods implemented with Hadoop.

Hadoop uses MapReduce sorting ideas, global sorting

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More