Php method of implementing topk-algorithm using binary heap

Source: Internet
Author: User
Tags shuffle
This article mainly introduces PHP using binary heap to implement the topk-algorithm, the article introduced in very detailed, for everyone has a certain reference learning value, the need for friends to follow the small series together to learn it.

Objective

In the past work or interview often encountered a problem, how to achieve massive topn, is in a very large result set to quickly find the largest top 10 or the first 100 numbers, while to ensure the efficiency of memory and speed, we may be the first idea is to use the sort, and then intercept the first 10 or 100, And the order is not particularly large when the amount of time without any problems, but as long as the volume is extremely large is impossible to complete the task, such as in an array or text file has hundreds of millions of numbers, so that is not all read into the memory, so the use of sorting to solve the problem is not the best, So here we use PHP to implement a small top heap to solve this problem.

Two-fork Pile

A binary heap is a special kind of heap, the binary heap is either a complete binary tree or an approximate complete binary tree, with two piles, a maximum heap and a minimum heap, and a maximum heap: The parent node's key value is always greater than or equal to the key value of any one child node; minimum heap: The key value of the parent node is always less than or equal to the key value of any one child


Small top heap-(pictures from the web)

The binary heap is generally represented by an array (see), for example, the root node in the array position is 0, the nth position of the child nodes in 2n+1 and 2n+2, so, the No. 0 position of the child nodes in the 1 and 2,1 child nodes at 3 and 4, and so on, this storage method is to find the parent node and child nodes.

The specific concept of the problem here is not more said, if the two fork heap in doubt can be in a good understanding of the data structure, the following we topn the above problem to use PHP code to achieve and solve, in order to see the difference here first with a sort of way to achieve the next look at the effect.

Using fast sorting algorithm to realize TopN

In order to test run memory size a bit ini_set (' memory_limit ', ' 2024M ');//Implement a quick Sort function quick_sort (array $array) {$length = count ($array ); $left _array = Array (); $right _array = Array (); if ($length <= 1) {  return $array;} $key = $array [0]; for ($i =1; $i < $length; $i + +) {  if ($array [$i] > $key) { c4/> $right _array[] = $array [$i];  } else{   $left _array[] = $array [$i];  }} $left _array = Quick_sort ($left _array); $right _array = Quick_sort ($right _ Array); Return Array_merge ($right _array,array ($key), $left _array); }//construct 500w distinct number for ($i =0; $i <5000000; $i + +) {$NUMARR [] = $i;} Disrupt them shuffle ($NUMARR);//Now we find Top10 the largest number Var_dump (Time ());p Rint_r (Array_slice (Quick_sort ($all), 0,10)); var_ Dump (Time ());

Results after running

Can see the above print out the results of TOP10, and output the next run time, about 99s, but this is only 500w number and all can be loaded into memory, if we have a file with 5kw or 500 million numbers, there will be some problems.

Using binary heap algorithm to realize TopN

The implementation process is:

1, first read 10 or 100 numbers into the array inside, this is our TOPN number.

2, call to generate a small top heap function, this array to generate a small top heap structure, this time the heap top must be the smallest.

3. Iterate through all the remaining numbers from a file or array.

4, each traversal of a stack with the top of the element size comparison, if less than the heap top element is discarded, if it is larger than the top element of the heap is replaced.

5, after replacing the top element with the heap, in the call to generate a small top heap function to continue to generate small top heap, because it needs to find a minimum.

6, repeat the above-mentioned steps, so that when the full traversal, we this small top heap inside is the biggest topn, because our small top heap is always ruled out the smallest left the largest, and this adjustment small top heap speed is also very fast, just relative adjustment, as long as the root node is less than the left and right nodes can be.

7, the algorithm complexity words according to TOP10 worst case, is each traverse a number, if with the heap top to replace, needs to adjust 10 times the situation, also is faster than the sorting speed, but also does not have all the content reads into the memory, can understand that is a linear traversal.

Generate a small top heap function heap (& $arr, $idx) {$left = ($idx << 1) + 1; $right = ($idx << 1) + 2; if (! $arr [$left]) { Return if ($arr [$right] && $arr [$right] < $arr [$left]) {$l = $right;}    else{$l = $left;} if ($arr [$idx] > $arr [$l]) {$tmp = $arr [$idx];   $arr [$idx] = $arr [$l];   $arr [$l] = $tmp; Heap ($arr, $l); }}//here in order to ensure consistent with the above, also constructs 500w not repeat number/* Of course, this data set is not necessarily all in memory, but also in the file, because we are not all loaded into memory to sort */for ($i =0; $i <5000000; $i + +) {$ Numarr[] = $i; }//disturb them shuffle ($NUMARR);//First remove 10 to array $toparr = Array_slice ($NUMARR, 0,10);//Get the last index position with child nodes// Because in the construction of the small top heap is from the last position of the left or right node//start from the bottom of the continuous movement of the structure (specifically to see the above figure to understand) $idx = Floor (count ($TOPARR)/2)-1;//generates a small top heap for ($i = $IDX; $i >=0; $i-) {Heap ($TOPARR, $i);} Var_dump (Time ());//You can see here that you start traversing all the remaining elements for ($i = count ($TOPARR), $i < count ($NUMARR), $i + +) {//each traversal is compared to the top element of the heap if (  $NUMARR [$i] > $TOPARR [0]) {//If it is greater than the top of the heap, replace $TOPARR [0] = $NUMARR [$i]; /* Re-call to generate the small top heap function for maintenance, but this time from the top of the heap at the index position starting from the bottom of the maintenance, because we just put the top of the heap to replace the elements and the rest of the root node is less than the left and right node in the order of this is what we said,Just the relative adjustment, not all adjustment once */Heap ($TOPARR, 0); }}var_dump (Time ());

Results after running

Can see the final result is also Top10, only the time used only about 1s, and both memory and time efficiency to meet our requirements, and with the ranking is the best thing is not to read all the data set into memory, because we do not need to sort, and the above is to demonstrate, So directly in memory constructs the 500w element, however we can transfer this all to the file, and then the line reads the comparison, because our data structure core point is the linear traversal and in the memory small top heap structure compares, finally obtains topn.

The above is the whole content of this article, I hope that everyone's study has helped.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.