PHP uses the binary heap to implement the TopK-algorithm-
Preface
In previous work or interviews, I often encountered a problem. How to Achieve a massive TopN is to quickly find the maximum number of top 10 or top 100 in a very large result set, at the same time, to ensure the memory and speed efficiency, we may be the first to use sorting, and then extract the first 10 or the first 100. Sorting is not a problem when the amount is not very large, however, it is impossible to complete this task as long as the volume is particularly large. For example, there are hundreds of millions of data entries in an array or text file, so that it cannot be fully read into the memory, so using sorting to solve this problem is not the best, so we will use php to implement a small top heap to solve this problem.
Binary heap
The binary heap is a special heap. The binary heap is a complete binary tree or an approximate full binary tree. The binary Heap has two types: Maximum heap and minimum heap. Maximum heap: the key value of the parent node is always greater than or equal to the key value of any subnode; the minimum heap: the key value of the parent node is always less than or equal to the key value of any subnode
Small top heap-(picture from network)
Binary heap is usually represented by arrays (see). For example, the root node is 0 in the array, and the child nodes at the nth position are 2n + 1 and 2n + 2, respectively, therefore, the sub-nodes in the 0th locations are in the 3 and 4 sub-nodes in the 1 and 2, and so on. This storage method is used to find the parent and sub-nodes.
I will not talk about the specific concept here. If you have any questions about the binary heap, you can have a good understanding of the data structure. Next we will use php code to implement and solve the topN problem, in order to see the difference, we first use the sorting method to see how it works.
Use the quick sorting algorithm to implement TopN
// Tune ini_set ('memory _ limit ', '2024m') to test the running memory; // implement a quick sorting function: quick_sort (array $ array) {$ length = count ($ array); $ left_array = array (); $ right_array = array (); if ($ length <= 1) {return $ array ;} $ key = $ array [0]; for ($ I = 1; $ I <$ length; $ I ++) {if ($ array [$ I]> $ key) {$ right_array [] = $ array [$ I];} else {$ left_array [] = $ array [$ I] ;}}$ left_array = quick_sort ($ left_array ); $ right_array = quick_sort ($ right_array); return array_merge ($ right_array, array ($ key), $ left_array);} // construct a 500w for ($ I = 0; $ I <5000000; $ I ++) {$ numArr [] = $ I;} // shuffle them ($ numArr ); // now we can find the largest number of top 10 var_dump (time (); print_r (array_slice (quick_sort ($ all),); var_dump (time ());
Results After running
We can see that the top 10 results are printed above, and the running time is about S. However, this is only million data records and all data can be loaded into the memory, if we have a file with a number of 5 kW or 0.5 billion, there will certainly be some problems.
TopN using the binary heap Algorithm
The implementation process is as follows:
1. First read 10 or 100 numbers into the array, which is our topN number.
2. Call the generate small top heap function to generate a small top Heap Structure for this array. At this time, the top heap must be the smallest.
3. traverse all the remaining numbers from a file or array in sequence.
4. Compare the size of each element that is traversed with the heap top. If it is smaller than the heap top element, discard it. If it is greater than the heap top element, replace it.
5. After replacing the element with the heap top, call the function to generate the heap top and generate the smallest heap.
6. Repeat the above 4 ~ Step 5: When all the traversal is complete, the top heap in our small heap is the largest topN, because our small top heap will always be the smallest and largest, in addition, the speed of this small top heap adjustment is also very fast, but with relative adjustment, you only need to ensure that the root node is smaller than the left and right nodes.
7. If the algorithm is complex, the worst case is to traverse a number every time. If it is replaced with the heap top, it needs to be adjusted 10 times, which is faster than sorting, in addition, instead of reading all the content into the memory, it can be understood as a linear traversal.
// Generate the function Heap (& $ arr, $ idx) {$ left = ($ idx <1) + 1; $ right = ($ idx <1) + 2; if (! $ Arr [$ left]) {return;} if ($ arr [$ right] & $ arr [$ right] <$ arr [$ left]) {$ l = $ right;} else {$ l = $ left;} if ($ arr [$ idx]> $ arr [$ l]) {$ tmp = $ arr [$ idx]; $ arr [$ idx] = $ arr [$ l]; $ arr [$ l] = $ tmp; Heap ($ arr, $ l) ;}/// to ensure the consistency with the above, we also construct a-W no-duplicated complex number/* Of course, This dataset is not necessarily fully stored in the memory or in the file, because not all of them are loaded into the memory for sorting */for ($ I = 0; $ I <5000000; $ I ++) {$ numArr [] = $ I;} // shuffle them ($ numArr); // first, retrieve 10 to the array $ topArr = array_slice ($ numArr ); // obtain the index position of the last sub-node. // when constructing a small top heap, the index position of the last left or right node starts from the bottom up. // mobile construction (see the figure above for details) $ idx = floor (count ($ topArr)/2)-1; // generate a small top heap for ($ I = $ idx; $ I >=0; $ I --) {Heap ($ topArr, $ I);} var_dump (time (); // you can see here, is to start traversing all the remaining elements for ($ I = count ($ topArr); $ I <count ($ numArr); $ I ++) {// compare the size of each traversal with the heap top element if ($ numArr [$ I]> $ topArr [0]) {// if the value is greater than the heap top element, replace $ topArr [0] = $ numArr [$ I];/* re-call to generate a small heap function for maintenance, however, this time, the index position on the top of the heap is maintained from top to bottom, because we just replaced the elements on the top of the heap, and the rest are placed in the order that the root node is smaller than the left and right nodes. This is what we mentioned above, but it is relatively adjusted, not all adjustments */Heap ($ topArr, 0) ;}} var_dump (time ());
Results After running
We can see that the final result is also the top 10, but it only takes about 1 s, and both the memory and time efficiency meet our requirements, what's better than sorting is that you don't need to read all the datasets into the memory, because we don't need to sort them, but the above is for demonstration, therefore, a million elements are directly constructed in the memory. However, we can transfer all the elements to the file, and then read and compare them in one row, because the core point of our data structure is linear traversal compared with the small top Heap Structure in the memory, the TopN is finally obtained.
Summary
The last thing we want to talk about is that the algorithm + data structure is really very important. A good algorithm can greatly improve our efficiency. Well, the above is all the content of this article. I hope the content of this article will help you in your study or work. If you have any questions, you can leave a message, thank you for your support.