Document directory
- 1. Fast sorting:
- 2. Heap.
The top K problem is a classic problem.
This problem is described as follows: Input n integers and output the smallest k elements. For example, input 1, 2, 3, 4, 5, 6, 7, 8, and the smallest 4 elements are 1, 2, 3, 4.
In addition to this, the top K problem also refers to the common problem of finding the top K number with the highest frequency in massive data, you can also find the largest number of first K from massive data. This type of problem is often referred to as the "top K" problem. For example, you can search for the top 10 most popular query words in a search engine; the top 10 songs with the highest download rate are recorded in the song library.
Speaking of the top K (First Class) problem, the two concepts that often flash in the head are fast sorting and heap. Why are these two concepts? Reasons:
1. Fast sorting:
Given a pivot element, an array can be divided into two parts based on this element. What is the role of this idea for top K problems? The answer is that the number of elements can be easily obtained based on the partition result (the pivot index is returned. Based on the recursive division of the number and K, we can conclude that the number of the preceding elements is K.
The implementation part of this idea can be seen: The http://blog.csdn.net/ohmygirl/article/details/7846544 quickly sorts the k elements of the array.
2. Heap.
Heap is actually a Complete Binary Tree. Heap has a good solution for two types of problems:. sorting Problem: Because the heap is a Complete Binary Tree, the N-element array of the heap is used for sorting. the time complexity does not exceed O (nlgn), and only a few extra spaces are required. B. priority queue. By inserting new elements and adjusting the heap structure, the nature of the heap is maintained. The time required for each operation is O (lgn ).
A common implementation of heap is to use an array of N storage elements, and units 0 are not needed. The elements in the heap are numbered from top to bottom and from left to right in sequence. For an element numbered I:
A: If the left child exists, the number of the left child is 2I B: If the right child exists, the number of the right child is 2 * I + 1 C: if there is a parent node, the parent node ID is I/2 D. If the node is a leaf node, the left child and the Right child are empty. If the node is empty, the condition is I <1 or I> N.
The heap design is very convenient for handling top K problems. First, set a heap with the size of K (if the maximum top K is obtained, use the minimum heap, and if the minimum top K is obtained, use the maximum heap), and then scan the array. Compare each element of the array with the heap root, insert the elements that meet the conditions into the heap, and adjust the heap to conform to the heap characteristics. After scanning, the elements retained in the heap are the final result. When talking about adjusting the heap, we have to mention the adjustment algorithm, which is divided into two types:
Shiftdown and shiftup ).
Take the minimum heap as an example:
The code corresponding to the upward adjustment algorithm is as follows:
void shiftUp(int *heap,int n){int i = n;for(;;){if(i == 1){break;}int p = i/2;if(heap[p] <= heap[i]){break;}swap(&heap[p],&heap[i]);i = p;}}
The corresponding code is adjusted as follows:
void shiftDown(int * heap,int n){ int i = 1; for(;;){int c = 2*i;if(c > n){break;}if(c+1 <= n){if(heap[c+1] <= heap[c]){c++;}}if(heap[i] <= heap[c]){break;}swap(&heap[c],&heap[i]);i = c;}}
With the basic heap operations, the top K problem has a foundation (of course, the top K problem can be solved completely without heap ). Take the minimum top K problem as an example (in this case, the maximum heap size of K needs to be established). The top K solution process is to scan the original array and throw the first k elements of the array into the heap, adjust it to maintain the heap features. If the element after K is smaller than the heap top element, replace the heap top element and adjust the heap. After scanning is completed, the elements saved in the heap are the final result.
Further thinking, how can we implement the top K Problem for massive data processing? Of course, the heap algorithm is still feasible. Are there any other ideas.
On the processing of massive data, it is recommended July blog: How to kill 99% of the massive data processing http://blog.csdn.net/v_july_v/article/details/7382693
Another blog that can be referenced: http://dongxicheng.org/big-data/select-ten-from-billions/
I have recently studied hadoop, so my idea is whether it is more efficient to use hadoop's mapreduce algorithm to achieve the top K. After all, hadoop is processing massive data, parallel Computing is still quite advantageous.
The idea of mapreduce is also very simple. For encoding, you only need to define the task class and then define the internal Mapper and reducer static classes.
Reprint A Piece Of mapreduce top K Code (the code is not tested, original address: http://www.linuxidc.com/Linux/2012-05/60234.htm ):
Package jtlyuan. csdn; import Java. io. ioexception; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. conf. configured; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapre Duce. reducer; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. input. textinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. tool; import Org. apache. hadoop. util. toolrunner; // use mapreduce to calculate the k Number of the maximum massive data. Public class topknum extends Configured implements tool {public static class mapclass extends mapper <longwritable, text, intwritable, intwritable> {public static final int K = 100; private int [] Top = new int [k]; public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string [] STR = value. tostring (). split (",",-2); try {// we ignore int temp = integer for non-numeric characters. parseint (STR [8]); Ad D (temp);} catch (numberformatexception e) {//} private void add (INT temp) {// insert if (temp> top [0]) {top [0] = temp; int I = 0; For (; I <99 & temp> top [I + 1]; I ++) {top [I] = top [I + 1] ;}top [I] = temp ;}@ override protected void cleanup (context) throws ioexception, interruptedexception {for (INT I = 0; I <100; I ++) {context. write (New intwritable (top [I]), new intwritable (top [I]) ;}} public static class Reduce extends reducer <intwritable, intwritable> {public static final int K = 100; private int [] Top = new int [k]; Public void reduce (intwritable key, iterable <intwritable> values, context) throws ioexception, interruptedexception {for (intwritable VAL: values) {Add (Val. get () ;}} private void add (INT temp) {// insert if (temp> top [0]) {top [0] = temp; int I = 0; (; I <99 & temp> top [I + 1]; I ++) {top [I] = top [I + 1];} top [I] = temp ;}@ override protected void cleanup (context) throws ioexception, interruptedexception {for (INT I = 0; I <100; I ++) {context. write (New intwritable (top [I]), new intwritable (top [I]);} public int run (string [] ARGs) throws exception {configuration conf = getconf (); job = new job (Conf, "topknum"); job. setjarbyclass (topknum. class ); Fileinputformat. setinputpaths (job, new path (ARGs [0]); fileoutputformat. setoutputpath (job, new path (ARGs [1]); job. setmapperclass (mapclass. class); job. setcombinerclass (reduce. class); job. setreducerclass (reduce. class); job. setinputformatclass (textinputformat. class); job. setoutputformatclass (textoutputformat. class); job. setoutputkeyclass (intwritable. class); job. setoutputvalueclass (intwritable. class); System. Exit (job. waitforcompletion (true )? 0: 1); Return 0;} public static void main (string [] ARGs) throws exception {int res = toolrunner. run (new configuration (), new topknum (), argS); system. exit (RES) ;}}/ ** to list the following parts: * 306 306 307 307 309 309 313 313 320 320 346 346 348 348 393 393 394 394 472 472 642 642 706 706 868 868 */
Now, we have a new idea for processing massive data: mapreduce + hadoop
Again, mapreduce is really an artifact of massive data processing ~