Find the minimum k count
Description: 5. Find the minimum k elements.
Question: Enter n integers and output the smallest k integers.
For example, if you enter the 8 numbers 1, 2, 3, 5, 6, 7, and 8, the minimum four digits are 1, 2, 3, and 4.
Section 1, various ideas and various choices
0. First, let's simply understand that the minimum k number in a sequence is required. According to the conventional way of thinking, it is very simple. First, we sort the sequence from small to large, then, output the minimum k number.
1. As for the sorting method of the selected items, I think you may think of quick sorting as soon as possible. We know that the average time consumed by quick sorting is n * logn, and then traverse the first k elements in the series for output, the total time complexity is O (n * logn + k) = O (n * logn ).
2. Let's further think about it. The question does not require the number of k to be searched, or even the number of the last n-k is ordered. In this case, why should we sort all n numbers in columns?
At this time, we thought of selecting or exchanging sorting, that is, traversing the number of n. First, we saved the first number of k to the array with the size of k, use selection or exchange sorting to find the maximum number of k kmax (the maximum element in the array where kmax is set to k elements). O (k) is used (you should know, insert or select the O (k) time required for the sorting search operation, and then traverse the n-k count. x is compared with kmax. If x <kmax, x replaces kmax, find the maximum element kmax in the array of k elements again (thanks to kk791159796 for correction); If x> kmax, the array is not updated. In this way, the time used for each update or non-update of the array is O (k) or O (0). The total time complexity is averaged as follows: n * O (k) = O (n * k ).
3. Of course, a better solution is to maintain the maximum heap of k elements. The principle is consistent with the above 2nd solutions, that is, the number of k First traversed by the maximum Heap Storage with a capacity of k, assuming that they are the minimum k number, there is k1 <k2 <... <kmax (kmax is set to the maximum element in the big top heap ). Continue to traverse the series, and traverse an element x each time. Compare it with the heap top element, x <kmax, and update the heap (logk used). Otherwise, the heap is not updated. In this case, the total time is O (k + (n-k) * logk) = O (n * logk ). This method is benefited from the fact that in the heap, the time complexity of operations such as search is logk (otherwise, as described in idea 2 above: Using arrays directly, you can also find the first k small elements, O (n * k )).
4. According to solution 2 on the beautiful 141st page of programming, similar to the division method of quick sorting, N numbers are stored in array S, then randomly select a number X from the array (randomly select the pivot element to achieve the complexity of the linear expected time O (N), which is discussed in section 2 ), divide the array into two parts: Sa and Sb. Sa <= X <= Sb. If the k elements to be searched are smaller than the number of Sa elements, return k smaller elements in Sa; otherwise, return k smaller elements in Sa + k-| Sa | smaller elements in Sb. Like in the preceding process, this SELECT algorithm, which uses a partition similar to quick sorting, is used to quickly find the smallest k elements. In the worst case, it can also achieve O (N) complexity. However, it is worth mentioning that this quick SELECT algorithm selects the "median" in the array as the pivot element rather than the randomly selected pivot element.
5. RANDOMIZED-SELECT: each time an element in the series is randomly selected as the principal element, and the k-small element is found in the 0 (n) time, then, it traverses and outputs The first k small elements. If yes, the total time complexity is the linear expected time: O (n + k) = O (n) (when k is relatively small ).
OK. In the second section, I will give the complete overall pseudo code of RANDOMIZED-SELECT (A, p, r, I. Before that, we need to clarify a problem: we generally know that fast sorting is based on a fixed first or last element as the principal element, and each recursive division is unequal, the final average time complexity is: O (n * logn), but RANDOMIZED-SELECT is different from normal quick sorting, each recursion selects either of the first and last elements of the sequence as the principal element at random.
6. linear time sorting, that is, counting sorting. Although the time complexity can reach O (n), there are too many restrictions and are not commonly used.
7. updated: huaye502 pointed out in the comment in this article: "The minimum heap can be used to initialize the array and then take the first k values of the priority queue. Complexity O (n) + k * O (log n )". Huaye502 refers to the minimum heap for the entire array sequence. The heap time is O (n). (Chapter 6th of the introduction to algorithms has been demonstrated in section 6.3. in linear time, creates a minimum heap for an unordered array, and obtains the number of the first k in the heap. The total time complexity is O (n + k * logn ).
Continue to elaborate on the above 7th ideas: As to whether O (n + k * logn) of idea 7 is less than O (n * logk) of Idea 3 ), O (n + k * logn )? <O (n * logk ). For rough mathematical proof, see the first figure below. We can solve this problem: when k is a constant and n tends to be an infinite number, evaluate (n * logk)/(n + k * logn) the limit T. If T> 1, you can get O (n * logk)> O (n + k * logn), that is, O (n + k * logn) <O (n * logk ). Although this is against our usual thinking, it turns out that this is true. This extreme value T = logk> 1, that is to say, the complexity of the method to obtain the first k number after creating the minimum heap of n elements is less than the complexity of the method to find the minimum k number after establishing the maximum heap of k elements.. However, the most important thing is that if the minimum heap of n elements is established, the space complexity is bound to be O (N ), the maximum heap space complexity for building k elements is O (k ). Therefore, we generally choose to create the maximum heap of k elements to solve the problem of finding the minimum k number.
It can also be roughly proved as described in gbb21: To prove the original k + n * logk-n-k * logn> 0, equivalent to the proof (logk-1) n-k * logn + k> 0. When n-> + inf (n tends to positive infinity), logk-1-0-0> 0, that is, as long as the logk-1> 0 can be satisfied. Original Certificate. That is, O (k + n * logk)> O (n + k * logn) => O (n + k * logn) <O (n * logk ), it is consistent with the conclusion above.
As a matter of fact, the actual running time is not much different between the max heap and the min heap, and the running time is an order of magnitude. In the future, we also wrote a program for testing, that is, to find the minimum k Number of million data, two implementations are adopted, first, we use the conventional method to create the maximum number of k elements and then find the minimum number of k elements through comparison. First, we use the minimum number of n elements to create a stack, then, the first k number is used to compare the two phases, and the running time is almost the same. The result is shown in the second figure below.
8. @ lingyun310: similar to the above idea 7, the difference is that after the minimum heap O (n) is created for the original element array, K times are extracted, but each time, to change the element to the top, you only need to move down at most k times, which is enough to reduce the number of moves (and the above idea 7 requires logn for each extraction, so the extraction is k times, idea 7 requires k * logn. In this case, 8 only needs K ^ 2 ). The complexity of this method is O (n + k ^ 2 ). @ July: I doubt the complexity of this O (n + k ^ 2. As far as I know, the complexity of any operation in the heap of n elements is logn, so it is reasonable to say that the complexity of the lingyun310 method should be the same as that of the following idea 8, it is also O (n + k * logn), not O (n + k * k ). OK. Put it here for time verification. 06.02.
Updated:
After discussion with several friends, it has been confirmed that the above ideas described in 7lingyun310 should be completely feasible. Next, let me explain his method in detail.
We know that in the minimum heap of n elements, we can first retrieve the top element of the heap to get our 1st small elements, and then put the last element (larger element) in the heap) move up to the top of the heap to form a new heap top element (after removing the heap top element, you can refer to the first figure below to send the last element in the heap bottom to the top of the heap. As for why, why is the last element sent to the top of the heap to become a heap top element instead of sending the son of the original heap top element to the top of the heap? For specific reasons, see related books ).
At this time, the nature of the heap has been damaged, so we need to adjust the heap later. How to adjust it? This is what the average person calls to move down the shiftdown of the new heap top element step by step (because the new heap top element comes from the last element, it is relatively large, since it is the smallest heap, of course, the big elements will sink to the lower part of the heap ). How many steps are sinking? As lingyun310 said, it is enough to sink k times.
After moving down k times, the heap top element is already the 2nd Small element we are looking. Then, take out the 2nd Small element (heap top element), again the last element in the heap to the heap top, and after the K-1 down (then move down the number of times gradually reduced, K-2, k-3 ,... k = 0 after algorithm interruption )...., so repeat the K-1, the heap top element is the minimum k number we are looking. Although the entire heap is no longer the smallest heap after the above algorithm is interrupted, the k minimum elements have already met the requirements of our questions, that is, the minimum k number has been found, so we don't care about anything else.
I can give another example that is easy to understand. As you can imagine, there are many bubbles in a bucket. The general trend of these bubbles increases gradually from top to bottom, however, it is not strictly successive increase (this is also in line with the minimum heap nature ). OK. Now let's take out the first bubble, which must be the smallest of all the bubbles in the bucket, then, move the bottom bubble (but not necessarily the largest bubble) to the top. This violates the general trend of increasing bubbles from top to bottom, where should I sink the big bubble? That is, sinking k times. After sinking k times, the top bubble must have been the smallest bubble. Then move the bottom of the last bubble to the top, move to the top, let it sink again, sink the K-1 times..., so round robin, eventually get to the minimum