Detailed analysis of Top K Algorithms

Source: Internet
Author: User
Tags repetition
Document directory
  • Problem description:
  • Problem Analysis:
  • Step 2: Find the Top 10
  • Conclusion:
Problem description:

This is a Baidu interview question found on the Internet:
The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes. Suppose there are currently 10 million records, and these query strings have a high degree of repetition. Although the total number is 10 million, the number of query strings should not exceed 3 million after repetition. The higher the repetition of a query string, the more users query it, that is, the more popular it is. Please count the top 10 query strings. The memory required cannot exceed 1 GB.

Problem Analysis:

[Analysis]: to count the most popular queries, first count the number of times each query appears, and then find the top 10 based on the statistical results. Therefore, we can design the algorithm in two steps based on this idea. The algorithms for these two steps are as follows:

Step 1: Query statistics

Algorithm 1: Direct sorting

The algorithm we can think of first is sorting. First, sort all the queries in this log, and then traverse the sorted query to count the number of times each query appears. But there is a clear requirement in the question, that is, the memory cannot exceed 1 GB, there are 10 million records, each record is 225 bytes, it is obvious that it occupies 2 to 55 GB memory, this condition does not meet the requirements.

Let's recall the content in the Data Structure course. When the data volume is large and the memory cannot be loaded, we can sort it by external sorting. Here I use the Merge Sorting method, it is because Merge Sorting has a better time complexity O (nlgn ).

After sorting, we traverse the sorted Query file, count the number of times each query appears, and write it into the file again.

According to a comprehensive analysis, the time complexity of sorting is O (nlgn), and the time complexity of traversal is O (n). Therefore, the overall time complexity of this algorithm is O (nlgn ).

Algorithm 2: Hash Table Method

In the previous method, we used the sorting method to count the number of times each query appears. The time complexity is nlgn. Can we have a better way to store the data, while the time complexity is lower?
The question shows that although there are 10 million queries, but because of the high repetition, there are actually only 3 million queries, each of which is bytes, we can consider putting them into the memory, now, we only need a suitable data structure. Here, Hash Table is definitely our priority, because the query speed of Hash Table is very fast, almost O (1) time complexity.
Then, our algorithm has: maintain a HashTable with the Key as the Query string and the Value as the number of occurrences of the Query. Read a Query each time. If the string is not in the Table, add the string and set the Value to 1. If the string is in Table, add one To the count of the string. Finally, we processed the massive data in the time complexity of O (N.
Compared with algorithm 1, this method increases the time complexity by an order of magnitude, but not only optimizes the time complexity. This method only requires one IO data file, algorithm 1 has a large number of I/O operations. Therefore, this algorithm has better operability than algorithm 1 in Engineering.

Step 2: Find the top 10 algorithm 1: Sorting

I don't want to go into details about sorting algorithms. We should note that the time complexity of sorting algorithms is NlgN. In this question, there are 3 million records, 1 GB memory can be used for storage.

Algorithm 2: Partial sorting

The requirement for the question is to find the Top 10, so we do not need to sort all the queries. We only need to maintain an array of 10 sizes and put it into 10 queries at initialization, sort by the statistics of each Query from large to small, and then traverse the 3 million records. Each read record is compared with the last Query of the array. If it is smaller than this Query, continue to traverse, otherwise, the last row of data in the array is eliminated and added to the current Query. Finally, after all the data is traversed, the 10 queries in this array are the top 10 we are looking.
It is not difficult to analyze that the time complexity of such an algorithm is N * K, where K refers to the top.

Algorithm 3: heap

In algorithm 2, we have optimized the time complexity from NlogN to NK. I have to say this is a big improvement. But is there any better way?
Analysis: In algorithm 2, after each comparison is completed, the operation complexity is K, because the elements need to be inserted into a linear table and sequential comparison is used. Here, we note that the array is ordered. We can use the binary search method every time we look for it. This reduces the complexity of the operation to the logK. However, the problem that arises is data movement, because the number of mobile data increases. However, this algorithm is better than algorithm 2.
Based on the above analysis, do you have a data structure that can quickly search and move elements? The answer is yes, that is, heap.
With the help of the heap structure, we can search, adjust, and move logs in a time range of log magnitude. So here, our algorithm can be improved to maintain a small root heap of K (10 in this question) and traverse the Query of 3 million, compare with the root element...
In this way, the algorithm's sending time complexity is reduced to NlogK, which is greatly improved compared with the algorithm.

Conclusion:

So far, our algorithm has completely ended. After the best combination of steps 1 and 2, our final time complexity is O (N) + O (N ') logK. If you have any good algorithms, please follow the discussion below.

 

From http://blog.redfox66.com/redfox66/blog/post/2010/09/23/top-k-algoriyhm-analysis.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.