1 Preface
For a long time, I have always asked myself what I want to learn and whether it will help my current job?
Yes, there are too many new things in the computer industry. If I talk about them all, I am also confused. It must be irrational and time-consuming, it is very likely that you have not learned deeply.
For massive data, the so-called big data, this term may be one of the hottest technologies. But do we really need to do it? I personally think that if the data you want to process does not reach the "massive" level, you should put it first. There is really no need to study for work, learning for research is another thing, not to mention it.
Unfortunately, some of the data on my hand (0.1 billion in magnitude) is too large, and traditional databases are no longer able to perform calculations, such as sorting tasks. This misfortune is actually a lucky thing, because it can finally be solved by the so-called "Big Data" technology.
Therefore, before you want to use a new technology, you 'd better ask yourself, "Is there a piece of meat that is hard to cut with such a quick knife "?
2. Basic big data knowledge preparation
Environment: several servers, of course, can also be single-host; it is only a matter of efficiency.
Basic: hadoop
Algorithms: Understanding the "divide and conquer" concept in classic algorithms
For big data sorting tasks, we need a hadoop runtime environment to run our sorting tasks.
3. Use terasort Resources
In fact, there is a very good example terasort in the hadoop source code.
The hadoop terasort sorting algorithm consists of three steps: Sampling-> map task to mark data records-> reduce task to perform local sorting.
Data Sampling is performed on the jobclient. First, some data is extracted from the input data, sorted, and divided into R data blocks, find the data ceiling and deprecation (called "Split points") of each data block and save these Split points to the distributed cache.
In the map stage, each map task first reads the Split points from the distributed cache and establishes a trie tree (two-layer trie tree, the reduce task number corresponding to the node is saved on the leaf node of the tree ). Then start processing the data. For each piece of data, find the number of the reduce task in the trie tree and save it.
In the reduce stage, each reduce task reads its corresponding data from each map task for local sorting, and finally outputs the result after the reduce task is processed by the reduce task number in sequence.
4 Sorting Algorithm ideas
For sorting massive data, the algorithm concept remains unchanged and the meaning remains unchanged.
Brief process description
1. Data preparation stage: obtain the data to be sorted from the target HDFS file, for example
12 24 11 60 23 46 79 1 21
2 sampling phase: Obtain the bucket division sample, such
11 24 46
4 buckets,-n ~ 11, (11 + 1 )~ 24, (24 + 1 )~ 46, (46 + 1 )~ + N. N indicates infinity.
4-bucket Division (bucket) stage: place each data to be sorted into the corresponding bucket by interval, such
Bucket A:-n ~ 11 -- 1
Bucket B: (11 + 1 )~ 24 -- 12, 24, 11, 23, 21
C Bucket: (24 + 1 )~ 46 -- 46
D Bucket: (46 + 1 )~ + N -- 60, 79
Note that the number of buckets is the number of reduce.
In this way, even if our samples are not needed, we can quickly sort buckets so that the buckets are output in sequence.
5-bucket sorting phase: sort the data in each bucket. When all the buckets are sorted, this phase ends. If the bucket is even, this phase is highly efficient, that is, the sample algorithm should try to make the division even. Mapreduce is used to sort data in different buckets in parallel. The algorithm can be selected based on actual needs. In short, the result of this phase is that the data in each bucket is ordered.
6. Merge phase: splice or sort buckets and output them.
Sorting of 0.1 billion data entries whose key is an integer
Code for searching for partitions
/***** @ Param arr * progressively ordered bucket Split points (sampling points) * @ Param des * Key * @ return bucket location */private int find (integer [] arr, integer des) {int low = 0; int high = arr. length-1; while (low <= high) {int middle = (low + high)/2; If (des = arr [Middle]) {return middle ;} else if (des <arr [Middle]) {high = middle-1 ;}else {LOW = middle + 1 ;}} if (Low = 0) return 0; return arr [low-1];}/*** this function is executed for each data, and the partition *** @ return returns the partition sequence number, for example, 0 indicates that this data entry is in 0th partitions */@ overridepublic int getpartition (intwritable key, text value, int numpartitions) {return find (splitpoints, key. get ());}
Terasort sorts the 0.1 billion data records and divides them into 100000 buckets to obtain 0.1 billion random data records for 3.4 GB.
The first page of the sorting result: