Massive data processing

Source: Internet
Author: User
Tags file size repetition


1, the massive log data, extracts one day to visit Baidu the most times the IP.

The number of IP bits is 32 bits, up to 2^32 a different IP, each IP accounted for 4B, a total of 2^32 * 4 B = 16GB. Therefore, in general, memory does not fit into these different IPs, so it is not possible to maintain a heap of methods.

Thought: The large file is divided into small files, each small file processing, and then comprehensive.

How to divide large files into multiple small files of the same size, if the IP is evenly distributed, you can use the hash method, the length of the hash table is divided into the number of small files; If the IP distribution is uneven, for example, these IPs are the same, Then, according to the order of IP appearance, the large files are divided into multiple size files or a hash algorithm. The following are the specific practices:

(1) According to the IP address of the hash (IP)%1024 value, the vast number of IP logs stored in 1024 small files. This way, each small file contains a maximum of 4MB IP address (IP distribution evenly);

(2) for each small file, build an IP key, the number of occurrences of the value of the HashMap, while recording the current occurrence of the most number of the IP address;

(3) Get the most occurrences of IP in 1024 small files, and then get the most occurrences of IP based on the general sorting algorithm;

2, the search engine through the log file will be used each time the user retrieves all the retrieved strings are recorded, the length of each query string is 1-255 bytes.

Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is, please count the hottest 10 query strings, which requires no more than 1GB of memory.

1M = 2^20 = 2^10 *2^10 is approximately equal to + * = 10^6,1g approximately equals 10^9. So it can be estimated at massive data processing: 1M = 10^6,1g = 10^9.

10 million *255b = 10^7 * 255B = 2.55GB, so memory is limited.

(1) divided into a number of small files: according to the search string hash (search string)%1000 large files into 1000 small files, each file about 2.5MB (retrieval string distribution evenly).

(2) For each small file, build a search string as key, the number of occurrences of value hashmap. Once the HashMap is set up, build a maximum heap of size 10 to find the T10 search string in each small file.

(3) The top 10 of each of these small files is synthesized, and the final top 10 results are counted.

In step (2), because the file size is very small, you can build a heap directly, the most sufficient to do heap sorting, remove Top 10.

3, given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL.

The size of each file can be estimated at 5gx64=320g, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider a way to divide and hash.

(1) Traverse file A, hash (URL)%1000 for each URL, and then store the URL into 1000 small files (a0,a1,..., a999) based on the obtained values. This is approximately 300M for each small file.

(2) Traverse file B and take the same way as a to store the URL in 1000 small files (recorded as B0,b1,..., b999).

(3) After this processing, all the possible same URLs are in the corresponding small file (a0vsb0,a1vsb1,..., a999vsb999), the corresponding small file may not have the same URL. Then just find the same URL in 1000 pairs of small files.

(4) When you request the same URL for each pair of small files, you can store all the URLs of one of the small files in the hash set. Then traverse each URL of the other small file, see if it is in the hash set just built, if so, then is the common URL, stored in the file can be.

4 . Find the non-repeating integer in 250 million integers, note that the memory is not enough to accommodate the 250 million integers.

Scenario 1: The bitmap method is used : Each number allocation 2bit,00 means no, 01 means one time, 10 means multiple times, and 11 is meaningless. Total Memory Required 2^32 * 2 bit=1 GB memory, also acceptable. Then scan these 250 million integers to see the relative bitmap in the 01,01, and if the change is 00, the 10,10 remains the same. After all the data has been scanned, look at the bitmap and output the integer with the corresponding bit 01.

Scenario 2: Adopt A method of partitioning small files. Then, in the small file, find the integers that are not duplicated and sort them. Then merge and take care to remove the duplicate elements.

5, give 4 billion non-repeating unsigned int integer, not ordered, and then give a number, how to quickly determine whether the number in the 4 billion number.

Scenario 1: Request 512M of memory, a bit bit represents a unsigned int value. Read in 4 billion numbers and set the appropriate bit. Read the number to query, see if the corresponding bit bit is 1, 1 is present, and 0 indicates that it does not exist.

Scenario 2: This issue is well described in the programming Zhu Ji Nanxiong:

Because the 2^32 is more than 4 billion, so given a number may or may not be in it; each of the 4 billion numbers is represented by a 32-bit binary, assuming that the 4 billion numbers start in a file.

Then divide the 4 billion numbers into two categories:

(1) The highest bit is 0

(2) The highest bit is 1

And the two categories are written to two files, one of the number of files <=20 billion, and the other >=20 billion (equivalent to binary);

Compare to the highest bit of the number you are looking for and then go to the appropriate file to find it again, and then divide the file into two categories:

(1) The second highest bit is 0

(2) The second highest bit is 1

And the two categories are written to two files, one of the number of files <=10 billion, while the other >=10 billion;

Compare with the second highest bit of the number you are looking for and then go to the appropriate file to find it.

.......

And so on, the time complexity is O (logn).

6, a 4GB size file store QQ number, pick out the most repeated top N.

Divide large files into small files, divide and conquer (using hash and heap sorting).

(1) According to the hash (QQ)%2^10 large files into 1024 small files (QQ distribution evenly), then each small file is 4MB.

(2) For each small file, set up to QQ as key, the number of occurrences of the value of the HashMap, and finally build the largest heap size n processing, find the top n of each file, the final order can be obtained.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.