[Reprinted] A summary of the Method for Finding the largest k Number

Source: Internet
Author: User

This article Reprinted from: http://www.cnblogs.com/zhjp11/archive/2010/02/26/1674227.html

Today, when we look at algorithm analysis, we can see a problem where we can find the largest K value in a pile of data.

The name is: design a group of N numbers, and determine the maximum value of K. This is a selection problem. Of course, there are many ways to solve this problem. I searched for it online, the following method is recommended.

The so-called "Nth (first) K large number problem" refers to finding the Nth (first) Order in S in the disordered array with a length of n (n> = K) k Number.

Solution 1: We can sort this out-of-order array in descending order, and then retrieve the first K in descending order. The total time complexity is O (n * logn + k ).
Solution 2: Select sorting or interactive sorting to obtain the k-th largest number after k-th selection. The total time complexity is O (n * K)
Solution 3: using the fast sorting idea, we randomly find an element x from array S and divide the array into two parts: SA and Sb. The element in SA is greater than or equal to X, and the element in Sb is less than X. There are two situations:
1. If the number of elements in SA is smaller than K, the K-| sa | element in Sb is the k-th number;
2. If the number of elements in SA is greater than or equal to K, the maximum K number in SA is returned. The time complexity is approximately O (n)
Solution 4: Binary [Smin, Smax] search result X, statistical x appears in the array, and the number of larger than X in the entire array is the number of k-1 is the k Number. The average time complexity is O (n * logn)
Solution 5: Use the O (4 * n) method to build the maximum heap for the original number, and then pop it out K times. Time Complexity: O (4 * n + K * logn)
Solution 6: maintain a minimum heap of K size. determine the size of each element in the array and the heap top. If the heap top is large, no matter. Otherwise, the heap top is displayed, insert the current value to the heap. Time complexity O (N * logk)
Solution 7: use hash to save the number of times the element Si appears in the array, and use the counting sorting idea, in the process of linear scanning from large to small, the number of k-1 is the K number, average time complexity O (N)

Note:
1. in STL, we can use nth_element to obtain a number similar to n (determined by the predicate), use the idea in solution 3, and use partial_sort to partially sort the range, get a number similar to the first K (determined by the predicate). It adopts the idea of solution 5.
2. Finding the median is actually a special case of the k-th large number.
The beauty of programming section 2.5 after-school exercises:
1. What if we need to find the largest k different floating point numbers in N numbers? For example, the maximum three different floating point numbers in an array containing 10 floating point numbers (1.5, 1.5, 2.5, 3.5, 3.5, 1.5, 3.5, 3.5, 2.5 ).
A: The above solutions are applicable. Note that the comparison of floating point numbers is different from that of integers. In addition, the method for calculating the hashkey is slightly different.
2. What if I find the number from K to M (0 <k <= m <= N?
Answer: If you think of the problem as the k-k + 1 problem, the preceding solution applies. But for problems similar to the top K, it is best to use solution 5 or solution 7, with a low overall complexity.
3. In search engines, each web page on the network has an "authority" weight, such as page rank. If we need to find the K web pages with the highest weight, and the weight of the web pages will be constantly updated, how can we change the algorithm to achieve rapid Update (incremental update) and return the K webpages with the highest weight in a timely manner?
Tip: heap sorting? Update the heap when the weight of each web page is updated. Is there a better way?
Solution: to achieve fast updates, solution 5 is provided. The update operation can reach O (logn) by using the ing binary heap)

4. In practical application, there is another issue of "accuracy. We may not need to return the maximum K Elements in a strict sense, and some errors may occur at the boundary position. When a user inputs a query, for each document D, there is a correlation between the query and the query to measure the weight F (query, D ). The search engine must return K webpages with the highest relevance weight. If there are 10 webpages on each page, the user will not care about the "accuracy" of the 1,000th-page external search results. A slight error is acceptable. For example, we can return a webpage with a correlation of 10th 001, rather than 9,999th. In this case, how can we improve the algorithm to make it faster and more efficient? The number of web pages may be as large as that of a single machine. What should I do?

Tip: Merge Sorting? If each machine returns the most relevant K documents, the Union of the most relevant K documents on all machines must include the most relevant K documents in the complete set. Because the boundary condition does not need to be very accurate, if each machine returns the best K' document, how should K' be taken, to achieve the accuracy of the 90% * k most relevant documents we have returned, alternatively, the most relevant K documents returned are more accurate than 90% (more than 90% of the K most relevant documents are indeed ranked in the top K in the whole set ), or the worst relevance sorting of the K most relevant documents returned does not exceed 110% * K.
A: As mentioned in the prompt, each machine can return the most relevant K' documents, and then use the Merge Sorting idea to obtain the most relevant K documents. The best case is that the K documents are evenly distributed among all machines. In this case, each machine only needs k' = K/n (n is the total number of all machines). In the worst case, all the most relevant K documents only appear on one of the machines. K' must be approximately K. I think it is better to maintain a heap on each machine and sort the elements on the top of the heap.

5. As mentioned in point 4th, for each document D, different keywords such as Q1, q2 ,..., QM, which respectively have correlation weights F (D, Q1), F (D, q2 ),..., F (D, QM ). If you enter the key word Qi, we have obtained the most relevant K documents, and the known key word Qj is similar to the key word Qi. The weight of the document is close to that of the two keywords, is it helpful to find K documents that are most relevant to Qj?

A: It must be helpful. When searching for the K documents most relevant to the keyword Qj, you can search for some of the documents related to the keyword Qj, and then search for all the documents globally.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.