Hadoop uses MapReduce to sort ideas, globally sort

Source: Internet
Author: User
Tags shuffle

This paper mainly deals with the ordering of key, mainly using the mechanism of Hadoop.

1, Partition

The partition function is to distribute the results of the map to multiple reduce. Of course, multiple reduce can reflect the benefits of distributed.

2. Ideas

As each partition internal is orderly, so long as the partition between the orderly, you can ensure that all orderly.

3. Questions

With the idea of how to define the boundaries of partition, this is a problem.

Workaround:Hadoop provides a sampler to help us estimate the entire boundary so that the data is allocated as evenly as possible


Reference: http://stblog.baidu-tech.com/?p=397

2. Examples of Hadoop applications: sequencing of large-scale data

The Hadoop platform does not provide global data sequencing, and global sorting of data in large-scale data processing is a very common requirement. A large number of data processing tasks that divide large-scale tasks into small data sizes must first globally sort large-scale data. For example, to handle the merging of attributes of two sets of large datasets, the two sets of data can be globally sorted and then decomposed into a series of small two-way merge problem implementations.

2.1 Methods of using Hadoop for global sorting of large-scale data

Using Hadoop to sort a lot of data sort the most straightforward way is to give the file all the content to map, map without any processing, direct output to a reduce, using Hadoop's own shuffle mechanism, all the data sorted, and then by the reduce direct output.

However, such a method is no different from the single machine, it is completely unable to use multi-computer distributed computing convenience. Therefore, this method is not possible.

Using the computing model of Hadoop divide and conquer, we can refer to the idea of fast sorting. Here, let's briefly recall the quick sort. Quick sort the basic step is to select one of the current data as the pivot point. Then put on one side greater than this fulcrum, less than this fulcrum on the other side.

Imagine if we had n pivots (which can be called rulers here), we could divide all the data into n+1 part, throw this n+1 part to reduce, sort it automatically by Hadoop, output n+1 internal ordered files, and merge the n+1 files into one file. Call it

As a result, we can summarize the steps of using Hadoop to sort large amounts of data:

1) sampling of sorted data;

2) Sorting the sampled data to produce the ruler;

3) Map calculates the two rulers of each data input, and sends the data to reduce of the corresponding interval ID.

4) reduce will get direct data output.

Here's an example of using a set of URLs to sort by:

Here's a little bit of a problem to deal with: How do I send data to a specific ID of reduce? Hadoop provides a variety of partitioning algorithms. These algorithms determine which reduce the data should be sent based on the key of the data that the map outputs (the sort of reduce is also dependent on key). Therefore, if you need to send the data to a reduce, as long as the output of the data at the same time, provide a key (in the above example is the id+url of reduce), the data should go where to go.

2.2 Precautions

1) The extraction of the ruler should be as uniform as possible, which is consistent with the selection of many variant algorithms which emphasize the fulcrum of fast sequencing.

2) HDFs is a file system with very asymmetric reading and writing performance. As far as possible the use of its high-performance characteristics of reading. Reduce reliance on write files and shuffle operations. For example, when data processing needs to be determined based on the statistics of the data. Dividing statistics and data processing into two rounds of map-reduce is much faster than combining statistics and data processing in one reduce.

3. Summary

Hadoop is a data-driven computing model, combined with MapReduce and HDFs, the task runs on the compute node of data storage, takes full advantage of the storage and computing resources of compute nodes, and greatly saves the overhead of network transmission data.

Hadoop provides a convenient platform for parallel computing with clusters. Various operational models that can isolate the dependencies between datasets can be well-applied on Hadoop. Then there will be more introduction to the large-scale data-base computing methods implemented with Hadoop.

Hadoop uses MapReduce sorting ideas, global sorting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.