Top K Selector

Source: Internet
Author: User

Transfer from http://cache.baiducontent.com/

Tango in the last written question of the final top K algorithm, the problem is clearly required to be as much as possible in the run time optimization, when the problem-solving is not a very good idea, the use of sorting and then get the head K data solution, but the obvious feeling that this solution is not optimal, So decided to make a further exploration of the problem.

Top k problem should be the current Internet is a very common application scenarios, such as search engine of popular keyword sorting, e-commerce website, such as the sale of hot goods. Because the Internet data is so large, the result set is generally much smaller than the size of the original dataset.

The most straightforward solution to the top K problem is to sort the entire data set, and then take the first k data as the result set (and the solution you think of when solving problems). Because the entire data set is sorted, the optimal time complexity of the algorithm is O (n * logn).

Obviously, because only the top K data needs to be removed, the remaining data sets can be ignored, so there is a second algorithm: Loop K and find the maximum number in the result set in each loop. The time complexity of this algorithm is O (k * n), considering that the number of result datasets is much smaller than the size of the original dataset, so it is relatively a faster algorithm.

In the second algorithm, however, a full-scale traversal of the data set is required in each loop to obtain the maximum value and a portion of the time is wasted. By flipping through the data, it is found that heap sequencing can be a good remedy for the weaknesses in the algorithm: when the heap is constructed, it is able to hold the sorting information for a subset of the data through the hierarchy of the tree, so that only the first element of the heap (the root node of the tree) is needed to get the maximum (small) value The first node and the last node are exchanged (to ensure the full binary tree nature of the heap), and the local adjustments are made to ensure the hierarchical ordering information of the heap. For the top k problem, the entire process includes the initialization heap and the subsequent value-swap-adjust heap, the time complexity of the algorithm is reduced to O (n + k * logn) which is O (k * logn).

Finally, the local computer is tested on the scene of the 20 largest integers taken from the dataset containing 5 million integers, and three algorithms take the time: algorithm one (fast sorting): 570ms; algorithm two (linear lookup): 170ms; algorithm three (for lookup): 16ms.

Top K Selector

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.