PHP Heap Implementation TOPK algorithm example

Source: Internet
Author: User
Tags shuffle

A binary heap is a special kind of heap, the binary heap is either a complete binary tree or an approximate complete binary tree, a binary heap with two, a maximum heap and a minimum heap, and a maximum heap: The parent node's key value is always greater than or equal to the key value of any one of the child nodes; minimum heap: The key value of the parent node is always less than or equal to the key value of

Small top heap-(pictures from the web)

The binary heap is generally represented by an array (see), for example, the root node in the array position is 0, the nth position of the child nodes in 2n+1 and 2n+2, so, the No. 0 position of the child nodes in the 1 and 2,1 child nodes at 3 and 4, and so on, this storage method is to find the parent node and child nodes.

The specific concept of the problem here is not more said, if the two fork heap in doubt can be in a good understanding of the data structure, the following we topn the above problem to use PHP code to achieve and solve, in order to see the difference here first with a sort of way to achieve the next look at the effect.

Using fast sorting algorithm to realize TopN

In order to test run memory size a bit ini_set (' memory_limit ', ' 2024M ');//Implement a quick Sort function quick_sort (array $array) {    $length = count ($ Array);    $left _array = Array ();    $right _array = Array ();    if ($length <= 1) {        return $array;    }    $key = $array [0];    for ($i =1; $i < $length; $i + +) {        if ($array [$i] > $key) {            $right _array[] = $array [$i];        } else{            $left _array[] = $array [$i];        }    }    $left _array = Quick_sort ($left _array);    $right _array = Quick_sort ($right _array);    Return Array_merge ($right _array,array ($key), $left _array);    } Constructs 500w non-repeating number for ($i =0; $i <5000000; $i + +) {    $NUMARR [] = $i;} Disrupt them shuffle ($NUMARR);//Now we find Top10 the largest number Var_dump (Time ());p Rint_r (Array_slice (Quick_sort ($all), 0,10)); var_ Dump (Time ());


Results after running

Can see the above print out the results of TOP10, and output the next run time, about 99s, but this is only 500w number and all can be loaded into memory, if we have a file with 5kw or 500 million numbers, there will be some problems.

implementing the TopN
implementation process using the binary heap algorithm is:
1, first read 10 or 100 numbers into the array, this is our TopN number.
2, call to generate a small top heap function, this array to generate a small top heap structure, this time the heap top must be the smallest.
3, which iterates through all the remaining numbers from a file or array in turn.
4, each walk out of a heap with the top of the element size comparison, if less than the top element of the heap is discarded, if it is greater than the top element of the heap is replaced.
5, after replacing the top element with the heap, the call to generate a small top heap function continues to generate a small top heap, because it needs to find a minimum.
6, repeat the above-mentioned steps, so that when the full traversal, we this small top heap is the largest topn, because our small top heap is always ruled out the smallest left the largest, and this adjustment small top heap speed is also very fast, just relative adjustment, as long as the root node is less than the left and right node can be .
7, the complexity of the algorithm, according to Top10 worst case, is each traversal of a number, if it is replaced with the top of the heap, it needs to be adjusted 10 times faster than the sorting speed, and not all the content read into memory, it can be understood as a linear traversal.

Generate a small top heap function heap (& $arr, $idx) {$left = ($idx << 1) + 1;    $right = ($idx << 1) + 2;    if (! $arr [$left]) {return;    if ($arr [$right] && $arr [$right] < $arr [$left]) {$l = $right;    }else{$l = $left;          if ($arr [$idx] > $arr [$l]) {$tmp = $arr [$idx];         $arr [$idx] = $arr [$l];         $arr [$l] = $tmp;    Heap ($arr, $l); }}//here in order to ensure consistent with the above, also constructs 500w not repeat number/* Of course, this data set is not necessarily all in memory, but also in the file, because we are not all loaded into memory to sort */for ($i =0; $i <5000000; $i + +) {$    Numarr[] = $i; }//disturb them shuffle ($NUMARR);//First remove 10 to array $toparr = Array_slice ($NUMARR, 0,10);//Get the last index position with child nodes// Because in the construction of the small top heap is from the last position of the left or right node//start from the bottom of the continuous movement of the structure (specifically to see the above figure to understand) $idx = Floor (count ($TOPARR)/2)-1;//generates a small top heap for ($i = $IDX; $i >=0; $i-) {Heap ($TOPARR, $i);} Var_dump (Time ());//You can see here that you start traversing all the remaining elements for ($i = count ($TOPARR), $i < count ($NUMARR), $i + +) {//each traversal is compared to the top element of the heap I       F ($numArr [$i] > $TOPARR [0]) {//if greater than the top of the heap is replaced $TOPARR [0] = $NUMARR [$i]; /* Re-call to generate the small top heap function for maintenance, but this time from the top of the heap index position starting from the upper and lower maintenance, because we just put the top of the heap to replace the elements and the rest of the root node is less than the left and right nodes in the order of the Pendulum    Put this is what we said above, but the relative adjustment, not all adjust the * * HEAP ($TOPARR, 0); }}var_dump (Time ());


Results after running

Can see the final result is also Top10, only the time used only about 1s, and both memory and time efficiency to meet our requirements, and with the ranking is the best thing is not to read all the data set into memory, because we do not need to sort, and the above is to demonstrate, So directly in memory constructs the 500w element, however we can transfer this all to the file, and then the line reads the comparison, because our data structure core point is the linear traversal and in the memory small top heap structure compares, finally obtains topn.

End
The last thing to say is that the algorithm + data structure is really very important, a good algorithm can make our efficiency greatly improved.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.