Massive processing-Hash partitioning

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem:

Find the 10 most visited IP addresses in the log file

Similar Variants include:

1,Search for the 10 most popular search terms in the search engine record;

2,Search for 10 words with the highest frequency in a large file;

3. Web ProxyIn the record,
Find the top 10 most visited URLs;

4,Sort the search records of a search engine by frequency;

5,Massive Data,
Find the one with the highest frequency;

These problems generally require that the data cannot be fully stored in the memory, or that the data has 100 Gb or 300 GB, or that only 1 GB of memory can be used. the purpose of this requirement is to prevent the respondent from making statistics once.

The solution is divide and
Conquer, while the divide method is to hash the key or element, and then divide the processing.
The reason for selecting hash is that the subproblem can be solved by the conquer.

The preceding variant 3 is used as an example to hash the data. The purpose of hash is that each subset can be accommodated in the memory and the hash method can be properly selected. For example, if it is a URL, you can use the cryptographic hash function such as SHA-1 to hash the URL, and then obtain the previous K bits for the obtained digest. then, each subset is pulled into the memory for frequency processing, for example, a (Key, count) hash_map,
Go through the subset element to obtain the highest-frequency element in each subset.

Then, the first 10 most frequently-occurring elements must be in the first 10 most frequently-occurring elements of each subset. For example, a total of 3000 subsets are divided, in this case, the first 10 elements of the first 3000 subsets are obtained (3000x10 = 30,000 elements in total) the maximum number of 10 elements can be obtained by filtering through the maximum heap of 10 elements. it's a bit of a sense of competition. the reason for selecting the first 10 of each subset is that the first 10 may come from one subset.

The above processing is actually a typical map-reduce model, and hash is a way to break down input data.

A similar problem.

Given two files a and B, each with 5 billion URL records and 4 GB memory available, how can we find the same URL records in the two files?

Still use hash to divide & conquer, according to a hash (such as the first few digits of MD5 (URL) split the two files into multiple small file a1-an, b1-bn. then, for small files with the same hash result, AI and Bi, all read the memory, and the rest will be adjusted as needed, such as constructing two sets (based on binary tree, hash, bloom
Filter) and then calculate the intersection, see the previous blog, http://www.cnblogs.com/qsort/archive/2011/05/06/2039201.html

The preceding hash-based partitioning algorithms are used. The key to the partitioning algorithm is to select an appropriate partitioning policy. the benefit of hash partitioning is that the same elements will certainly be divided into the same subset, which is very suitable for scenarios such as statistical frequency.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Massive processing-Hash partitioning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Massive processing-Hash partitioning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support