Massive processing-Hash partitioning

Source: Internet
Author: User

Problem:

Find the 10 most visited IP addresses in the log file

Similar Variants include:

1,Search for the 10 most popular search terms in the search engine record;

2,Search for 10 words with the highest frequency in a large file;

3. Web ProxyIn the record,
Find the top 10 most visited URLs;

4,Sort the search records of a search engine by frequency;

5,Massive Data,
Find the one with the highest frequency;

 

These problems generally require that the data cannot be fully stored in the memory, or that the data has 100 Gb or 300 GB, or that only 1 GB of memory can be used. the purpose of this requirement is to prevent the respondent from making statistics once.

The solution is divide and
Conquer, while the divide method is to hash the key or element, and then divide the processing.

The reason for selecting hash is that the subproblem can be solved by the conquer.

The preceding variant 3 is used as an example to hash the data. The purpose of hash is that each subset can be accommodated in the memory and the hash method can be properly selected. For example, if it is a URL, you can use the cryptographic hash function such as SHA-1 to hash the URL, and then obtain the previous K bits for the obtained digest. then, each subset is pulled into the memory for frequency processing, for example, a (Key, count) hash_map,
Go through the subset element to obtain the highest-frequency element in each subset.

Then, the first 10 most frequently-occurring elements must be in the first 10 most frequently-occurring elements of each subset. For example, a total of 3000 subsets are divided, in this case, the first 10 elements of the first 3000 subsets are obtained (3000x10 = 30,000 elements in total) the maximum number of 10 elements can be obtained by filtering through the maximum heap of 10 elements. it's a bit of a sense of competition. the reason for selecting the first 10 of each subset is that the first 10 may come from one subset.


The above processing is actually a typical map-reduce model, and hash is a way to break down input data.


A similar problem.

Given two files a and B, each with 5 billion URL records and 4 GB memory available, how can we find the same URL records in the two files?

Still use hash to divide & conquer, according to a hash (such as the first few digits of MD5 (URL) split the two files into multiple small file a1-an, b1-bn. then, for small files with the same hash result, AI and Bi, all read the memory, and the rest will be adjusted as needed, such as constructing two sets (based on binary tree, hash, bloom
Filter) and then calculate the intersection, see the previous blog, http://www.cnblogs.com/qsort/archive/2011/05/06/2039201.html

 

The preceding hash-based partitioning algorithms are used. The key to the partitioning algorithm is to select an appropriate partitioning policy. the benefit of hash partitioning is that the same elements will certainly be divided into the same subset, which is very suitable for scenarios such as statistical frequency.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.