Big Data Sorting first experience

Source: Internet
Author: User
1 Preface

For a long time, I have always asked myself what I want to learn and whether it will help my current job?

Yes, there are too many new things in the computer industry. If I talk about them all, I am also confused. It must be irrational and time-consuming, it is very likely that you have not learned deeply.

For massive data, the so-called big data, this term may be one of the hottest technologies. But do we really need to do it? I personally think that if the data you want to process does not reach the "massive" level, you should put it first. There is really no need to study for work, learning for research is another thing, not to mention it.

Unfortunately, some of the data on my hand (0.1 billion in magnitude) is too large, and traditional databases are no longer able to perform calculations, such as sorting tasks. This misfortune is actually a lucky thing, because it can finally be solved by the so-called "Big Data" technology.

Therefore, before you want to use a new technology, you 'd better ask yourself, "Is there a piece of meat that is hard to cut with such a quick knife "?

2. Basic big data knowledge preparation

Environment: several servers, of course, can also be single-host; it is only a matter of efficiency.

Basic: hadoop

Algorithms: Understanding the "divide and conquer" concept in classic algorithms

For big data sorting tasks, we need a hadoop runtime environment to run our sorting tasks.

3. Use terasort Resources

In fact, there is a very good example terasort in the hadoop source code.

The hadoop terasort sorting algorithm consists of three steps: Sampling-> map task to mark data records-> reduce task to perform local sorting.

Data Sampling is performed on the jobclient. First, some data is extracted from the input data, sorted, and divided into R data blocks, find the data ceiling and deprecation (called "Split points") of each data block and save these Split points to the distributed cache.

In the map stage, each map task first reads the Split points from the distributed cache and establishes a trie tree (two-layer trie tree, the reduce task number corresponding to the node is saved on the leaf node of the tree ). Then start processing the data. For each piece of data, find the number of the reduce task in the trie tree and save it.

In the reduce stage, each reduce task reads its corresponding data from each map task for local sorting, and finally outputs the result after the reduce task is processed by the reduce task number in sequence.

4 Sorting Algorithm ideas

For sorting massive data, the algorithm concept remains unchanged and the meaning remains unchanged.

Brief process description

1. Data preparation stage: obtain the data to be sorted from the target HDFS file, for example

12 24 11 60 23 46 79 1 21

2 sampling phase: Obtain the bucket division sample, such

11 24 46

4 buckets,-n ~ 11, (11 + 1 )~ 24, (24 + 1 )~ 46, (46 + 1 )~ + N. N indicates infinity.

4-bucket Division (bucket) stage: place each data to be sorted into the corresponding bucket by interval, such

Bucket A:-n ~ 11 -- 1

Bucket B: (11 + 1 )~ 24 -- 12, 24, 11, 23, 21

C Bucket: (24 + 1 )~ 46 -- 46

D Bucket: (46 + 1 )~ + N -- 60, 79

Note that the number of buckets is the number of reduce.

In this way, even if our samples are not needed, we can quickly sort buckets so that the buckets are output in sequence.

5-bucket sorting phase: sort the data in each bucket. When all the buckets are sorted, this phase ends. If the bucket is even, this phase is highly efficient, that is, the sample algorithm should try to make the division even. Mapreduce is used to sort data in different buckets in parallel. The algorithm can be selected based on actual needs. In short, the result of this phase is that the data in each bucket is ordered.

6. Merge phase: splice or sort buckets and output them.

Sorting of 0.1 billion data entries whose key is an integer

Code for searching for partitions

/***** @ Param arr * progressively ordered bucket Split points (sampling points) * @ Param des * Key * @ return bucket location */private int find (integer [] arr, integer des) {int low = 0; int high = arr. length-1; while (low <= high) {int middle = (low + high)/2; If (des = arr [Middle]) {return middle ;} else if (des <arr [Middle]) {high = middle-1 ;}else {LOW = middle + 1 ;}} if (Low = 0) return 0; return arr [low-1];}/*** this function is executed for each data, and the partition *** @ return returns the partition sequence number, for example, 0 indicates that this data entry is in 0th partitions */@ overridepublic int getpartition (intwritable key, text value, int numpartitions) {return find (splitpoints, key. get ());}

Terasort sorts the 0.1 billion data records and divides them into 100000 buckets to obtain 0.1 billion random data records for 3.4 GB.

The first page of the sorting result:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.