Classification of mass data processing methods

Source: Internet
Author: User
Tags repetition

From small Lao Dahai data, find the most popular (highest frequency) of a certain data, or the first 100 of the data. In general, the data size is hundreds of G, and the memory limit is 1 g, complete the calculation.

Application Scenarios:
(1) Massive log data, extract the most visited Baidu one day the most number of that IP;
(2) The search engine will log the user every retrieval using all the search string is recorded, assuming that there are currently 10 million records (these query string is a high degree of repetition, although the total is 10 million, but if you remove the repetition, no more than 3 million.) The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.
Solution: The main memory limit. If the memory is enough, then we use the sort to be OK; that memory is not enough, we will come in batches. In batches, how do you first think about the batches? Obviously the condition to be satisfied is to try to divide the same log and IP into the same file, so as to avoid multiple files that contain the same log in each other. Logically, if divided into 1000 small files, according to the hash (IP)% 1000 will be the log, and then each small file with HashMap for frequency statistics, and finally 1000 small files in the most * * (ex-*) of the log extracted, using the regular sort can.

There are 10 files, each file 1G, each line of each file is stored in the user's query, each file can be repeated query. Ask you to sort by the frequency of the query

Solution: A typical top k algorithm. Template: First with HashMap statistical frequency (or go to weight), maintenance of a K-size heap, statistics top k on the line.

Given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

Solution Ideas: You can estimate the size of each file is 5gx64=320g, much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider adopting a divide-and-conquer approach.
Traverse file A, hash (URL)%1000 for each URL, and then store the URL into 1000 small files (in A0,a1,..., a999) based on the obtained values. This is approximately 300M for each small file.
Traversing file B, take the same way as a to store the URL in 1000 small files (b0,b1,..., b999). After this processing, all possible identical URLs are in the corresponding small file (a0vsb0,a1vsb1,..., a999vsb999), the corresponding small file cannot have the same URL. Then we only ask for 1000 of the same URL in the small file.
When you request the same URL for each pair of small files, you can store the URL of one of the small files in Hash_set. Then traverse each URL of the other small file, see if it is in the hash_set just built, if so, then is the common URL, stored in the file can be.

Find the non-repeating integer in 250 million integers, note that the memory is not enough to accommodate the 250 million integers

Scenario 1: The use of 2-bitmap (each number allocation 2bit,00 means that there is no, 01 means one time, 10 means multiple times, 11 meaningless), a total of memory 2^32 * 2 bit=1 GB of memory, but also acceptable. Then scan these 250 million integers to see the relative bitmap in the 01,01, and if the change is 00, the 10,10 remains the same. After the stroke is finished, look at the bitmap, and the corresponding bit is 01 integer output.
Scenario 2: You can also use a method similar to the 1th question to divide a small file. Then, in the small file, find the integers that are not duplicated and sort them. Then merge and take care to remove the duplicate elements.

to 4 billion non-repeating integers of unsigned int, not ordered, and then given a number, how to quickly determine whether the number is in the 4 billion number?

The bitmap method is suitable for this case, it is done by the largest element in the collection of Max to create a new array of length max+1, and then scan the original array again, encountered a few to the new array of the first position of 1, if encountered 5 to the new array of the sixth element 1, So the next time you encounter 5 want to set the new array to find the sixth element is already 1, which indicates that the data and the previous data must be duplicated. This method is at the expense of memory.
Method Summary: The memory is not enough, divide and conquer, how to divide the rule, take the hash; want to take top K, just use the heap!

Classification of mass data processing methods

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.