Use heap to implement the TopK algorithm (JS implementation) _ javascript skills

Source: Internet
Author: User
This article mainly introduces the use of heap to implement the TopK algorithm, namely JS implementation. The TopK algorithm is described in detail in this article. If you are interested, you can refer to the Top K algorithm, the details are as follows:

Application scenarios:

The search engine records all the search strings used for each search using log files. The length of each query string is 1-bytes.
Suppose there are currently 10 million records (these query strings have a relatively high degree of repetition, although the total number is 10 million, but if the repetition is not removed, there will be no more than 3 million records. The higher the repetition of a query string, the more users query it, that is, the more popular it is .), Please count the top 10 query strings. The memory required cannot exceed 1 GB.

Required knowledge:
What is a hash table?
A Hash table (also called a Hash table) is a data structure that is directly accessed based on the Key value.

That is to say, It maps the key value to a location in the table to access records to speed up the search. This ing function is called a hash function, and the array storing records is called a hash function.

The hash table method is actually very simple, that is, to convert the Key into an integer using a fixed algorithm function called a hash function, and then perform the remainder operation on the array length, the remainder result is used as the subscript of the array, and the value is stored in the array space with the number as the base object.
When a hash table is used for query, the hash function is used again to convert the key to the corresponding array subscript and locate the space to obtain the value, the array positioning performance can be fully utilized for data locating.
Problem Analysis:

To count the most popular queries, you must first count the number of times each Query appears, and then find the Top 10 based on the statistical results. Therefore, we can design the algorithm in two steps based on this idea.

That is, there are two steps to solve this problem:

Step 1: Query statistics (count the number of times each Query appears)
Query statistics are available in the following two methods:
1) Direct sorting (usually timing in log files, using cat file | format key | sort | uniq-c | sort-nr | head-n 10)
The first algorithm we come up with is sorting. First, we sort all the queries in this log, and then traverse the sorted Query to count the number of times each Query appears.

But there is a clear requirement in the question, that is, the memory cannot exceed 1 GB, there are 10 million records, each record is 255 bytes, it is obvious that it will occupy Gbps memory, this condition does not meet the requirements.

Let's recall the content in the Data Structure course. When the data volume is large and the memory cannot be loaded, we can sort it by external sorting. Here we can sort it by merging, because Merge Sorting has a better time complexity O (NlgN ).

After sorting, we traverse the sorted Query file, count the number of times each Query appears, and write it into the file again.

According to a comprehensive analysis, the time complexity of sorting is O (NlgN), and the time complexity of traversal is O (N). Therefore, the overall time complexity of this algorithm is O (N + NlgN) = O (NlgN ).

2). Hash Table (this method counts the number of occurrences of strings very well)
In the 1st methods, we used the sorting method to count the number of times each Query appears. The time complexity is NlgN. Can we have a better way to store the data, while the time complexity is lower?

The question shows that although there are 10 million queries, there are only 3 million queries, each of which is 255 bytes, therefore, we can consider putting them all in the memory, but now we only need a suitable data structure. Here, Hash Table is definitely our preferred choice, the Hash Table query speed is very fast, which is almost the time complexity of O (1.

Then, our algorithms are:

Maintain a HashTable with the Key as the Query string and Value as the number of occurrences of the Query. Read a Query every time. If the string is not In the Table, add this string, set the Value to 1. If the string is in Table, add one To the count of the string. Finally, we processed the massive data in the time complexity of O (N.

Compared with algorithm 1, this method increases the time complexity by an order of magnitude, which is O (N), but not only the optimization of time complexity. This method only requires one IO data file, algorithm 1 has a large number of I/O operations. Therefore, algorithm 2 has better operability than algorithm 1 in Engineering.

Step 2: Find the Top 10 (find the Top 10 most frequently)
Algorithm 1:Normal sorting (we only need to find the top 10, so all sorting is redundant)
I don't want to go into details about sorting algorithms. We should note that the time complexity of sorting algorithms is NlgN. In this question, there are 3 million records, 1 GB memory can be used for storage.

Algorithm 2:Partial sorting
The requirement for the question is to find the Top 10, so we do not need to sort all the queries. We only need to maintain an array of 10 sizes, and put 10 queries in initialization, sort by the statistics of each Query from large to small, and then traverse the 3 million records. Each read record is compared with the last Query of the array. If it is smaller than this Query, continue to traverse, otherwise, the last piece of data in the array should be eliminated (or placed in the proper position to maintain order) and added to the current Query. Finally, after all the data is traversed, the 10 queries in this array are the top 10 we are looking.

In this way, the worst time complexity of the algorithm is N * K, where K refers to the top.

Algorithm 3:Heap
In algorithm 2, we have optimized the time complexity from NlogN to N * K. I have to say this is a big improvement, but is there any better way?

Analysis: In algorithm 2, after each comparison is completed, the operation complexity is K, because the elements need to be inserted into a linear table and sequential comparison is used. Here, we note that the array is ordered. We can use the binary search method every time we look for it. This reduces the complexity of the operation to the logK. However, the problem that arises is data movement, because the number of mobile data increases. However, this algorithm is better than algorithm 2.

Based on the above analysis, do you have a data structure that can quickly search and move elements?

The answer is yes, that is, heap.
With the help of the heap structure, we can search, adjust, and move logs in a time range of log magnitude. So here, our algorithm can be improved to maintain a small root heap K (10 in this question) and traverse the 3 million Query to compare it with the root element.

The idea is consistent with the above two algorithms, but in algorithm 3, we use the data structure of the smallest heap to replace the array, and the time complexity of searching the target element is O (K) reduced to O (logK ).
In this way, with the heap data structure and algorithm 3, the final time complexity is reduced to N * logK, which is greatly improved compared with algorithm 2.

So far, the algorithm has completely ended. After the first step, use the Hash table to calculate the number of times each Query appears, O (N). Then, step 2, use the heap data structure to find the Top 10, N * O (logK ). Therefore, our final time complexity is: O (N) + N '* O (logK ). (N is 10 million, N is 3 million ).

How does js use heap to implement the Top K algorithm?

1. Use the heap algorithm to implement the Top node. the time complexity is O (LogN)

function top(arr,comp){ if(arr.length == 0){return ;} var i = arr.length / 2 | 0 ; for(;i >= 0; i--){ if(comp(arr[i], arr[i * 2])){exch(arr, i, i*2);} if(comp(arr[i], arr[i * 2 + 1])) {exch(arr, i, i*2 + 1);} } return arr[0];     }    function exch(arr,i,j){ var t = arr[i]; arr[i] = arr[j]; arr[j] = t; }

2. Call the K-times heap implementation. The time complexity is O (K * LogN)

function topK(arr,n,comp){ if(!arr || arr.length == 0 || n <=0 || n > arr.length){ return -1; }     var ret = new Array(); for(var i = 0;i < n; i++){ var max = top(arr,comp); ret.push(max); arr.splice(0,1); } return ret; }

3. Test

var ret = topK(new Array(16,22,91,0,51,44,23),3,function (a,b){return a < b;}); console.log(ret);

The above is the Top K algorithm implemented by heap. What is the Top K algorithm? I hope it will help you learn it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.