A classic interview question: how to select the maximum (small) n numbers from N numbers?
I have considered this issue for almost a year before and after, and I have discussed it with many people. According to the information I got, Google and Microsoft have met this question. Many people may have heard of this question or know the answer (the so-called "Heap"), but I want to write out my answer. My analysis may contain vulnerabilities for communication purposes. However, this is a complicated problem, which is quite interesting. After reading this article, you may be inspired by some solutions you have never thought of (I always think heap may not be the most efficient algorithm ). In this article, we will continue to look for N "largest" numbers for analysis examples for unification. Note: This article will be detailed so that most people can understand it. Don't bother me :) I'm not sure how many people have the patience to read this article!
Naive method:
First, let's assume that N and n are both in memory, that is to say, N numbers can be loaded into the memory and stored in the array at a time (if it is necessary to have a chain table, it is another challenging problem ). From the simplest case, if n = 1, then no doubt, you must compare the N-1 times to get the maximum number, directly traverse N number can be. What if n = 2? Of course, you can directly traverse the n Array twice, and obtain the maximum number of max1 for the first time, but when traversing the second number of max2 for the second time, each time, you must determine that the subscript of the element N is not equal to the subscript of max1, which greatly increases the number of comparisons. There is a solution to this problem. We can use max1 as the split point to divide the n Array into the first and second parts, traverse the two parts respectively to get two "Max numbers", and then get one of them to max2.
You can also traverse it to solve this problem. First, maintain the two elements max1, max2 (max1> = max2), get one of the N, and first compare it with max1, if it is larger than max1 (it must be larger than max2), replace max1 directly; otherwise, compare it with max2 to determine whether to replace max2. Using a similar method, for n = 2, 3, 4 ...... Can be processed in the same way. The time complexity of such an algorithm is O (NN ). When N is getting bigger (it is impossible to exceed n/2, otherwise it can become a dual problem of finding n-n smallest numbers), the efficiency of this algorithm will get worse and worse. However, when n is relatively small (it is hard to say how small it is), this algorithm is simple and does not involve system losses such as recursive calls, the actual efficiency should be very good, just like when the order was made, the gap was reached, and the meide was hard, and the Meilan was determined by the Meilan system?
Heap:
What algorithm is used when n is large? First, we analyze the above algorithm. When a new number M is obtained from N, it needs to be in sequence with max1, max2, max3 ...... Max n comparison, always find a max X smaller than m, replace Max X with m, the average number of comparisons is n/2. Can we replace it with a smaller number of comparisons? The most intuitive method is the heap that is favored by online articles. Stack has the following advantages: 1. it is a complete binary tree. The depth of the tree is the least Binary Tree of the same node, and the maintenance efficiency is high; 2. it can be implemented through arrays, and the parent node P and the Left and Right subsections l, the relationship between the lower mark of the array of R points is s [l] = 2 * s [p] + 1 and S [R] = 2 * s [p] + 2. 2 * s [p] in a computer can be implemented with a one-bit left shift operation, which is very efficient. In addition, Arrays can be randomly accessed, which is highly efficient. 3. The extract operation of the heap, that is, it takes O (logn) to take the heap top and re-maintain the heap. Here N is the heap size.
How can we implement the problem? First, open up an array Area A with the size of N, read N from N and fill in N to, then maintain a as a small top heap (that is, the minimum number in a is stored in heap a [0 ). Then, the next number (n + 1 m) is taken from N, and m is compared with heap top a [0]. If M <= A [0], M is directly discarded. Otherwise, replace a [0] with M. However, the heap feature of a may have been damaged. You should re-maintain the heap: start from a [0] and compare a [0] with the Left and Right subnodes (note, here we need to compare the "two times" to determine the maximum number. I will compare it with the "Loser Tree" later). If a [0] is smaller than the left and right subnodes, the heap feature can be ensured. Do not continue. Otherwise, if the left (right) node is the largest, a [0] is exchanged with the left (right) node, and the left (right) node is maintained) subtree. Run the command in sequence until n is fully traversed. The N number retained in the heap is the N largest in N. This is the basic knowledge of heap sorting. the only trick is to maintain a small heap instead of a large heap. Think about it a little bit. The time complexity of maintaining a heap is O (logn). The overall complexity is O (nlogn). When n is large enough, the heap efficiency must be higher. Of course, the result can be obtained by directly setting up a heap for N numbers and then extracting the heap for n times, and its complexity is O (nlogn ), when N is not very small, it will be much faster. However, there is no way for online data. For example, N cannot be loaded into the memory at a time, or even a stream, and N is unknown.
Loser tree:
Is there any other algorithm? Let me first talk about the loser tree ). Some people may not understand loser tree very well. In fact, it is a classic external sorting method, that is, there are X sorted files that are merged into an ordered sequence. The idea of the loser tree is actually to reduce the number of comparisons. First, let's briefly introduce the loser tree: The leaf node of the loser tree is a data node, and then grouping them by two (if the total number of nodes is not the power of 2, a tree can be constructed using a structure similar to a full tree ), the internal node is used to record the "loser" among the winners of the left and right subtree (note that the winner is the loser), while the winner transfers the comparison to the root node until the root node. If our winner is a small number in two numbers, the root node records the "loser" in the last comparison, that is, the second small number in all the leaf nodes, the smallest number is recorded in an independent variable. Note that the internal node should not only record the value of the loser, but also record the corresponding leaf node. If a tree is composed of a linked list, a pointer is required for the internal node to point to the leaf node. Here, there can be a trick, that is, internal nodes only record the leaf nodes corresponding to the "loser, the specific values can be indirectly accessed as needed (this method is very useful when using arrays to implement the loser tree, which will be discussed later ). The key is that after the minimum value is output, the leaf node corresponding to the minimum value needs to be changed to a new number (or changed to an infinite number, indicating that the file has been read when the file is merged ). Next, we will maintain the loser tree and compare it with the internal nodes from the updated leaf node network. We will update the "loser" and compare the winners. Because the updated node occupies the leaf node of the previous minimum value, it goes up to the root node and the path of the previous minimum value is exactly the same. Although the "loser" of internal node records is called "loser", it is the smallest number in its subtree. That is to say, as long as the winner is compared with the "loser", it is the smallest number in the subtree (here it is a bit round. If you cannot understand it, please refer to this book, it is easier to understand the chart ).
Note: You can also directly construct the loser tree for N, but the loser tree cannot be incrementally maintained like a heap when implemented using arrays, when the number of leaf nodes changes, the entire tree needs to be completely rebuilt. In order to compare the performance of heap and loser tree, the subsequent analysis is to analyze the heap and loser trees built on N numbers.
In short, when the loser tree is being maintained, the number of comparisons is logn + 1. Different from the heap, the loser tree is maintained from the bottom up. For each layer, you only need to compare it with the loser node once. While the heap is maintained from top to bottom, each layer needs to be compared with the Left and Right subnodes, and needs to be compared twice. From this perspective, the loser tree is better than the heap tree. However, please note that every maintenance of the loser tree must go from the leaf node to the root node, and it is impossible to stop in the middle. During Heap maintenance, "It is possible" to stop at a layer in the middle and do not need to continue. In this way, although each layer of the loser tree needs to be twice as many as the heap, the number of heap layers will be less than that of the loser tree. Which of the following is more efficient on average? I don't know. This analysis is a little troublesome. If you are interested, try it and discuss it. But at least it is explained that the heap may not be optimal.
Specific to our problems. In a similar way, first construct a loser tree with N leaf nodes. The winner W is the smallest of N. After reading a new number m from N and comparing it with W, if it is smaller than W, it will be discarded directly. Otherwise, replace the value of the leaf node where W is located with M, then maintain the loser tree. Execute the statement in sequence until n is traversed. The number of N retained in the loser tree is the maximum number of N in N. Time complexity is also O (nlogn)
As mentioned above, the advantages of heap include "full tree", "implement with array", and "special relationship between parent nodes and left and right subnodes ". In fact, the loser tree can also be implemented using arrays. In fact, I didn't think so before. When I went into Microsoft for an internship, my programming question was file merging. I made it with the loser tree and built the tree with pointers, after two hours, I got the question three hours, and almost didn't kill me. It is too troublesome to maintain pointers. David mentioned yesterday that the loser tree can be implemented using arrays. The principle is the same as the Array Implementation of the heap. The only difference is that all nodes in the heap are data nodes, and only the leaf nodes in the loser tree are data nodes. Therefore, in terms of space complexity, the space required by the loser tree is twice the size of the heap (the number of internal nodes of the complete tree is one fewer than the number of leaf nodes ).
The problem should be further analyzed. Are you almost asleep? Haha
Class quick sorting method:
Everyone is familiar with quick sorting. The main idea is to find an "axis" node and convert the sequence into two parts. One part is less than or equal to the "axis", and the other part is greater than or equal to the "axis ", then perform recursive processing on the two parts. The average time complexity is O (nlogn ). We can be inspired by this. If we select an axis that makes the number of the "larger" part of the exchanged number j exactly n, is it true that the task of finding the N largest number of N numbers is completed? Of course, the axis may not be exactly the same. It can be analyzed in this way. If j> N, the maximum number of N must be in the number of J. The problem is to find the maximum number of N in the number of J; otherwise, if j <n, the number of J must be part of N's largest number, and the remaining number of J-N must be smaller than or equal to that of the axis, it can also be recursive.
The average complexity of this algorithm is O (n. How is it? Is it better than the heap O (nlogn ?! (If n is large, it will certainly be good)
It should be noted that the time complexity here is on average, in the worst case, each split is divided into 1: N-2, in this case, the time complexity is O (n ). However, we still have a killer. We can have an algorithm with a time complexity of O (n) in the worst case. This algorithm ensures that the time series is evenly divided, at least 3n/10-6. For more information, see Introduction to algorithms (Chapter 9 "medians and orders Statistics" of the second edition of Introduction to algorithms ").
The conclusion is that the heap may not be optimal.
This article is about to end, but there is another problem: If n is very large and stored on the disk, it cannot be loaded into the memory at one time? What should I do? The introduced naive method, heap, and loser tree are still applicable. Note that each time you read more data from the disk to the memory area, you can read the data in a batch after processing. Reducing the number of I/O operations naturally improves efficiency. For the fast sorting method of classes, it is a little troublesome: Batch reading, assuming the number of M, and then selects N of the maximum number from the number of M to cache, after all N numbers are processed in batches, combine the N numbers cached in each batch and then perform a quick sort of classes to obtain the final N largest number. During the running process, if there are too many caches, You can merge multiple caches and keep the maximum n of these caches. Because the time complexity of fast sorting of classes is O (n), the method of batch processing and merging is still very likely to be better than the heap and the loser tree. Of course, it will occupy a lot of memory in space.
Conclusion: I have thought a lot about this problem, but I think there are still some places to continue digging: 1. Which of the following is better? It can be analyzed theoretically or compared through experiments. Some people may think this is boring. 2. Is there an approximate algorithm or probability algorithm to solve this problem? I am not familiar with this aspect. If someone has an idea, they can communicate with each other. If there are any errors or omissions in the analysis, please let us know, I am not afraid of shame! At last, keep in mind that the time complexity is not equal to the actual running time. An O (logn) algorithm with a large constant factor may be much slower than an O (n) algorithm with a small constant factor. Therefore, the specific values of N and N, as well as the quality of programming implementation, will affect the actual efficiency. I have read a thesis. The algorithm provided is faster than hash in string SEARCH, isn't it hard to imagine?