[Z] A summary of the Method for Finding the largest k Number

Last Update:2018-12-07 Source: Internet

Author: User

Tags ranges vmin

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.cnblogs.com/zhjp11/archive/2010/02/26/1674227.html

Today, when we look at algorithm analysis, we can see a problem where we can find the largest K value in a pile of data.

The name is: design a group of N numbers, and determine the maximum value of K. This is a selection problem. Of course, there are many ways to solve this problem. I searched for it online, the following method is recommended.

The so-called "Nth (first) K large number problem" refers to finding the Nth (first) Order in S in the disordered array with a length of n (n> = K) k Number.

Solution 1: We can sort this out-of-order array in descending order, and then retrieve the first K in descending order. The total time complexity is O (n * logn + k ).
Solution 2: Select sorting or interactive sorting to obtain the k-th largest number after k-th selection. The total time complexity is O (n * K)
Solution 3: using the fast sorting idea, we randomly find an element x from array S and divide the array into two parts: SA and Sb. The element in SA is greater than or equal to X, and the element in Sb is less than X. There are two situations:
1. If the number of elements in SA is smaller than K, the K-| sa | element in Sb is the k-th number;
2. If the number of elements in SA is greater than or equal to K, the maximum K number in SA is returned. The time complexity is approximately O (n)
Solution 4: Binary [Smin, Smax] search result X, statistical x appears in the array, and the number of larger than X in the entire array is the number of k-1 is the k Number. The average time complexity is O (n * logn)
Solution 5: Use the O (4 * n) method to build the maximum heap for the original number, and then pop it out K times. Time Complexity: O (4 * n + K * logn)
Solution 6: maintain a minimum heap of K size. determine the size of each element in the array and the heap top. If the heap top is large, no matter. Otherwise, the heap top is displayed, insert the current value to the heap. Time complexity O (N * logk)
Solution 7: use hash to save the number of times the element Si appears in the array, and use the counting sorting idea, in the process of linear scanning from large to small, the number of k-1 is the K number, average time complexity O (N)

Note:
1. in STL, we can use nth_element to obtain a number similar to n (determined by the predicate), use the idea in solution 3, and use partial_sort to partially sort the range, get a number similar to the first K (determined by the predicate). It adopts the idea of solution 5.
2. Finding the median is actually a special case of the k-th large number.
The beauty of programming section 2.5 after-school exercises:
1. What if we need to find the largest k different floating point numbers in N numbers? For example, the maximum three different floating point numbers in an array containing 10 floating point numbers (1.5, 1.5, 2.5, 3.5, 3.5, 1.5, 3.5, 3.5, 2.5 ).
A: The above solutions are applicable. Note that the comparison of floating point numbers is different from that of integers. In addition, the method for calculating the hashkey is slightly different.
2. What if I find the number from K to M (0 <k <= m <= N?
Answer: If you think of the problem as the k-k + 1 problem, the preceding solution applies. But for problems similar to the top K, it is best to use solution 5 or solution 7, with a low overall complexity.
3. In search engines, each web page on the network has an "authority" weight, such as page rank. If we need to find the K web pages with the highest weight, and the weight of the web pages will be constantly updated, how can we change the algorithm to achieve rapid Update (incremental update) and return the K webpages with the highest weight in a timely manner?
Tip: heap sorting? Update the heap when the weight of each web page is updated. Is there a better way?
Solution: to achieve fast updates, solution 5 is provided. The update operation can reach O (logn) by using the ing binary heap)

4. In practical application, there is another issue of "accuracy. We may not need to return the maximum K Elements in a strict sense, and some errors may occur at the boundary position. When a user inputs a query, for each document D, there is a correlation between the query and the query to measure the weight F (query, D ). The search engine must return K webpages with the highest relevance weight. If there are 10 webpages on each page, the user will not care about the "accuracy" of the 1,000th-page external search results. A slight error is acceptable. For example, we can return a webpage with a correlation of 10th 001, rather than 9,999th. In this case, how can we improve the algorithm to make it faster and more efficient? The number of web pages may be as large as that of a single machine. What should I do?

Tip: Merge Sorting? If each machine returns the most relevant K documents, the Union of the most relevant K documents on all machines must include the most relevant K documents in the complete set. Because the boundary condition does not need to be very accurate, if each machine returns the best K' document, how should K' be taken, to achieve the accuracy of the 90% * k most relevant documents we have returned, alternatively, the most relevant K documents returned are more accurate than 90% (more than 90% of the K most relevant documents are indeed ranked in the top K in the whole set ), or the worst relevance sorting of the K most relevant documents returned does not exceed 110% * K.
A: As mentioned in the prompt, each machine can return the most relevant K' documents, and then use the Merge Sorting idea to obtain the most relevant K documents. The best case is that the K documents are evenly distributed among all machines. In this case, each machine only needs k' = K/n (n is the total number of all machines). In the worst case, all the most relevant K documents only appear on one of the machines. K' must be approximately K. I think it is better to maintain a heap on each machine and sort the elements on the top of the heap.

5. As mentioned in point 4th, for each document D, different keywords such as Q1, q2 ,..., QM, which respectively have correlation weights F (D, Q1), F (D, q2 ),..., F (D, QM ). If you enter the key word Qi, we have obtained the most relevant K documents, and the known key word Qj is similar to the key word Qi. The weight of the document is close to that of the two keywords, is it helpful to find K documents that are most relevant to Qj?

A: It must be helpful. When searching for the K documents most relevant to the keyword Qj, you can search for some of the documents related to the keyword Qj, and then search for all the documents globally.

For a more detailed solution, see http://www.binghe.org/2011/05/find-kth-largest-number-in-disorder-array/

Number of K in the unordered integer Array

Write a program to find the number of K in the array and the position of the output number.

Solution 1]

Let's assume that the number of elements is small, for example, about several thousand. In this case, let's sort them. Here, fast sorting or heap sorting are good options. Their average time complexity is O (n * log2n ). Then extract the first K, O (k ). Total time complexity O (N * log2n) + O (K) = O (N * log2n ).

You must note that when k = 1, the above algorithm is also the complexity of O (N * log2n), and obviously we can get the result through the comparison and exchange of N-1 times. The above algorithm to the entire array are sorted, and the original question only requires the maximum K number, do not need the first K number order, also do not need the number of N-K after order.

How can we avoid sorting the number of N-K after? We need some sorting algorithms. Selecting sorting and switching sorting is a good choice. Sorts the first K of N numbers in a large order. The complexity is O (n * K ).

What is better? O (N * log2n) or O (N * k )?
This depends on the size of K, which is a problem you need to figure out in the interviewer. If K (k <= log2n) is small, partial sorting can be selected.

In the next solution, we will avoid sorting the first K numbers for better performance.

Solution 2]

Recall that each step in the quick sorting is to divide the data to be sorted into two groups. One group of data has a larger number than the other, then perform similar operations on the two groups, and then proceed ......

In this problem, assuming n numbers are stored in array S, we randomly find an element x from array S and divide the array into two parts: SA and Sb. The element in SA is greater than or equal to X, and the element in Sb is less than X.

There are two possibilities:

1. the number of elements in SA is less than K. The number in SA and the maximum K-in SB | sa | elements (| sa | the number of elements in SA) is the maximum K number in array S.

2. If the number of elements in SA is greater than or equal to K, the maximum K Elements in SA must be returned.

In this way, the problem is continuously decomposed into smaller problems, and the average time complexity is O (n * log2k ). The pseudocode is as follows:

Kbig (S, k ):
If (k <= 0 ):
Return [] // return an empty array

If (length S <= k ):
Return s
(SA, Sb) = partition (s)
Return kbig (SA, k). append (kbig (SB, K-length SA)

Partition (s ):

Sa = [] // The Initialization is an empty array.

SB = []
// Select a random number as the grouping standard to avoid algorithm degradation in special data.
// You can also shuffles the entire data for preprocessing.
// Swap (s [1], s [random () % length S])
P = s [1]
For I in [2: length S]:
S [I]> P? SA. append (s [I]): SB. append (s [I])
// Adding P to a small group can avoid group failure and make the group more even and improve efficiency.
Length SA <length sb? SA. append (P): SB. append (P)
Return (SA, Sb)

Solution 3]

Finding the maximum K number in N is essentially the smallest k number, that is, the maximum K number. You can use the binary search policy to find the k-th largest number of N numbers. For a given number P, we can find all the numbers not less than P in the time complexity of O (n. If the maximum number of N numbers is Vmax and the minimum number is Vmin, the k Number of N numbers must be in the range [Vmin, Vmax. Then, you can search for the nth big number P in the number of N in this interval. The pseudocode is as follows:

While (Vmax-Vmin> delta)
{
Vmid = Vmin + (Vmax-Vmin) * 0.5;
If (f (ARR, N, vmid)> = K)
Vmin = vmid;
Else
Vmax = vmid;
}

In the pseudo code, F (ARR, N, vmid) returns the array arr [0 ,..., N-1] is larger than the number equal to vmid.

In the pseudo code above, the Delta value is smaller than the minimum value of any two unequal elements in all N numbers. If all elements are integers, Delta can be set to 0.5. After the loop is run, a Range (Vmin, Vmax) is obtained, which contains only one element (or multiple equal elements ).
This element is the k-th element.
The time complexity of the entire algorithm is O (n * log2 (| Vmax-Vmin |/delta )).
Because the value of Delta is smaller than the minimum value of any two unequal elements in all N numbers, the time complexity is related to the data distribution. In the case of average data distribution, the time complexity is O (n * log2 (n )).

In the case of integers, we can look at this algorithm from another perspective. Assume that All integers are in the range of [0, 2s-1], that is, all integers can be expressed in binary using M bit (from low to high, respectively 0, 1 ,..., M-1 ). We can first look at m-1) bit of the binary, and divide n integers into two parts based on the bit 1 or 0. That is, the integer is divided into two intervals: [0, 2m-1-1] and [2s-1, 2s-1. The first integer in the range is 0, and the second integer in the range is 1. If the number of integers whose bits are 1 is greater than or equal to K, continue to find the maximum K of All integers whose bits are 1. Otherwise, find the largest K-A In the integer with this digit 0. Then consider the binary (m-2) bit, and so on. The idea is essentially the same as the above floating point number.

For the above two methods, we need to traverse the entire set and count the number of integers larger than or equal to a certain number in the set. You do not need to perform random access. If all the data cannot be loaded into the memory, you can traverse the file every time. After statistics, update the resolution interval, traverse the file again, and store the elements in the new interval into the new file. In the next operation, you no longer need to traverse all elements. Two file traversal is required each time. In the worst case, the total number of file traversal times is 2 * log2 (| Vmax-Vmin |/delta ). The number of elements decreases after each update across partitions.
When all elements can be loaded into the memory, they can no longer be read or written to files.

In addition, finding the k Number in N is a classic problem. Theoretically, this problem involves linear algorithms. However, the constant term of this linear algorithm is relatively large, and the effect is sometimes poor in practical application.

Solution 4]

We have obtained three solutions. However, the three solutions share a common situation where data needs to be accessed multiple times. The next problem is that if n is large, 10 billion? (In more cases, the interviewer asks you this question ). At this time, data cannot be fully loaded into the memory (but it is hard to say that it is cheaper to know if 1 TB of memory will be used in the future). Therefore, we need to traverse as little data as possible.

Let's set N> K. The maximum K number in the first K number is a degradation case, and the number of all k is the maximum K number. What if we consider the k + 1 Number X? If X is smaller than Y, the maximum number of K remains unchanged. If X is larger than Y, the maximum K number should be removed from Y, which contains x. If an array is used to store the maximum K number, and each new number X is added, the array is scanned to obtain the minimum number Y in the array. Replace y with X or keep the original array unchanged. This method takes O (N * K ).

Furthermore, the minimum heap with a capacity of K can be used to store the maximum K number. The minimum heap top element is the smallest of the maximum K numbers. Each time a new number X is considered, if X is smaller than Y on the top of the heap, the original heap does not need to be changed because this element is smaller than the maximum K number. If X is larger than the heap top element, use X to replace y. After x replaces the heap top element y, X may damage the structure of the smallest heap (each node is larger than its parent node). It is necessary to update the heap to maintain its nature. The time complexity of the update process is O (log2k ).

Figure 2-1 is a heap, represented by an array H. For each element H [I], its parent node is H [I/2], and its son node is H [2 * I + 1] and H [2 * I + 2]. Each new number X is considered. The pseudo code for the update operation is as follows:

If (x> H [0])
{
H [0] = X;
P = 0;
While (P <K)
{
Q = 2 * p + 1;
If (q> = K)
Break;
If (q <K-1) & (H [q + 1] <H [Q])
Q = q + 1;
If (H [Q] <H [p])
{
T = H [p];
H [p] = H [Q];
H [Q] = T;
P = Q;
}
Else
Break;
}
}

Therefore, the algorithm only needs to scan all the data once, and the time complexity is O (n * log2k ). This is actually part of the heap sorting algorithm. In terms of space, because this algorithm only scans all data once, we only need to store a heap with a capacity of K. In most cases, the heap can load all the memory. If K is still very large, we can first find the largest K' element, and then find the K' + 1 to 2nd * K' elements, and so on (the heap with the capacity K' can be fully loaded into the memory ). However, we need to scan all data ceil1 (K/k') times.

Solution 5]

The average time complexity of the preceding fast sorting method is linear. Is there a definite linear algorithm? Can we improve counting sorting and base sorting to get a more efficient algorithm? The answer is yes. However, the applicability of algorithms is limited.

If all N numbers are positive integers and their value ranges are not too large, you can apply for a space, record the number of occurrences of each integer, and then obtain the maximum K values from large to small. For example, if all integers are in the (0, maxn) range, use an array count [maxn] to record the number of occurrences of each INTEGER (count [I] indicates the number of occurrences of integer I among all integers ). We only need to scan it once to get the Count array. Then, find the k-th element:

For (sumcount = 0, V = MAXN-1; V> = 0; V -)
{
Sumcount + = count [v];
If (sumcount> = K)
Break;
}
Return V;

In extreme cases, if n integers are different, we even need a bit to store whether the integer exists.

Today, when we look at algorithm analysis, we can see a problem where we can find the largest K value in a pile of data.

The so-called "Nth (first) K large number problem" refers to finding the Nth (first) Order in S in the disordered array with a length of n (n> = K) k Number.

For a more detailed solution, see http://www.binghe.org/2011/05/find-kth-largest-number-in-disorder-array/

Number of K in the unordered integer Array

Write a program to find the number of K in the array and the position of the output number.

Solution 1]

What is better? O (N * log2n) or O (N * k )?
This depends on the size of K, which is a problem you need to figure out in the interviewer. If K (k <= log2n) is small, partial sorting can be selected.

In the next solution, we will avoid sorting the first K numbers for better performance.

Solution 2]

There are two possibilities:

1. the number of elements in SA is less than K. The number in SA and the maximum K-in SB | sa | elements (| sa | the number of elements in SA) is the maximum K number in array S.

2. If the number of elements in SA is greater than or equal to K, the maximum K Elements in SA must be returned.

In this way, the problem is continuously decomposed into smaller problems, and the average time complexity is O (n * log2k ). The pseudocode is as follows:

Kbig (S, k ):
If (k <= 0 ):
Return [] // return an empty array

If (length S <= k ):
Return s
(SA, Sb) = partition (s)
Return kbig (SA, k). append (kbig (SB, K-length SA)

Partition (s ):

Sa = [] // The Initialization is an empty array.

Solution 3]

While (Vmax-Vmin> delta)
{
Vmid = Vmin + (Vmax-Vmin) * 0.5;
If (f (ARR, N, vmid)> = K)
Vmin = vmid;
Else
Vmax = vmid;
}

In the pseudo code, F (ARR, N, vmid) returns the array arr [0 ,..., N-1] is larger than the number equal to vmid.

Solution 4]

Solution 5]

For (sumcount = 0, V = MAXN-1; V> = 0; V -)
{
Sumcount + = count [v];
If (sumcount> = K)
Break;
}
Return V;

In extreme cases, if n integers are different, we even need a bit to store whether the integer exists.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More