How to find the 10 most frequently occurring frequencies from 1 billion query terms

Source: Internet
Author: User
Tags hash
1. Problem description

In large-scale data processing, often encountered a kind of problem is to find the highest frequency of the largest number of top K, or from the mass of data to find the maximum number of the first k, such problems are commonly referred to as "top K" problem, such as: In search engines, statistical search the most popular 10 query words , in the song Library, the highest download rate of the top 10 songs and so on.

2. Current solution

For top K class problems, it is usually better to "divide and conquer the +trie tree/hash+ small Top heap", that is, the data set in accordance with the hash method into a number of small data sets, and then use the trie tree or hash to count each small data set query word frequency, Then we use the small top pile to find out the highest number of top k in each data set, and finally find the final top K in all top K.

In fact, the optimal solution should be the most appropriate to the actual design needs of the solution, in the actual application, there may be enough memory, then the data can be thrown directly into memory for one-time processing, or there may be multiple cores of the machine, so that the entire data set with multithreading.

In this paper, we introduce a solution for different application scenarios.

3. Solution Solutions

3.1 Standalone + single Core + large enough memory

With an average of 8Byte per query term, the memory required for 1 billion query words is approximately 10^9*8=8g memory. If you have so much memory, sort the query words directly in memory, and sequentially traverse to find the 10 most frequent 10. This method is simple, fast and more practical. Of course, you can also first use HashMap to find out the frequency of each word, and then find the most frequent occurrence of the 10 words.

3.2 Stand-alone + multicore + large enough memory

This can be directly in the memory of the practical hash method to divide the data into n partition, each partition to a thread processing, the thread processing logic is similar to the 3.1 section, the last thread to merge the results.

There is a bottleneck in this approach that can significantly affect efficiency, that is, data skew, the processing speed of each thread may be different, fast threads need to wait for a slow thread, and the final processing speed depends on the slow thread. The workaround is to divide the data into C*n partition (c>1), and each thread will take off a partition after processing the current partition and continue processing until all the data has been processed and finally merged by a thread.

3.3 Standalone + single Core + limited memory

In this case, the original data file needs to be cut into a small file, such as the use of hash (x)%M, the original file to cut the data into small M files, if the small file is still larger than the memory size, continue to use the hash method to cut the data file, until each small file smaller than the size of memory, so, Each file can be put into memory for processing. Each small file is processed sequentially using the 3.1 section method.

3.4 Multi-machine + limited memory

In this case, in order to make reasonable use of the resources of multiple machines, data can be distributed to multiple machines, each machine using 3.3 of the strategy in the resolution of local data. The Hash+socket method can be used for data distribution.

From the point of view of practical application, the scheme of 3.1~3.4 section is not feasible, because the operation efficiency is not the primary consideration in the large-scale data processing environment, and the expansibility and fault tolerance of the algorithm is the primary consideration. The algorithm should have good extensibility, so that the data volume is further increased (with the development of the business, the data increase is inevitable), without modifying the framework of the algorithm can achieve an approximate linear ratio; the algorithm should be fault-tolerant, that is, the current file processing failure, can be automatically handed over to another thread to continue processing, Rather than starting from scratch.

The Top k problem is well suited to the MapReduce framework, where users simply write a map function and two reduce functions and commit to Hadoop (with Mapchain and Reducechain) to resolve the problem. For the map function, the hash algorithm is used to give the same hash value to the same reduce task, and for the first reduce function, the frequency of each word appears by HashMap, and for the second reduce function, all the reduce The top k in the task output data.

4. Summarize

Top k problem is a very common problem, the company generally do not write a program to calculate, but submitted to their core data processing platform calculation, the platform may not be as efficient as the direct write program, but it has good scalability and fault tolerance, and this is the most important to the enterprise.

5. Resources

"10 massive data processing surface question and 10 methods big Summary": http://blog.csdn.net/v_JULY_v/archive/2011/03/26/6279498.aspx

Original articles, reproduced please specify: Reproduced from Dong's blog

This article link address: http://dongxicheng.org/big-data/select-ten-from-billions/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.