Basic algorithm (four): The processing method of massive data

Source: Internet
Author: User
Tags repetition

Divide and conquer +hashmap

1, the massive log data, extracts one day to visit Baidu the most times the IP.

The first is this day, and is to visit Baidu's log in the IP out to write to a large file. Note that the IP is 32-bit and has a maximum of 2^32 IP. The same can be used to map the method, such as module 1000, the entire large file mapping to 1000 small files, and then find out the frequency of each of the most frequent IP (can be used hash_map frequency statistics, and then find the largest number of frequencies) and the corresponding frequency. Then in the 1000 largest IP, find out the most frequent IP, that is, the request.

Or as described below (the Eagle of the Snow field):
algorithm idea: Divide and conquer +hash

The 1.IP address has a maximum of 2^32=4g, so it can not be fully loaded into memory processing;
2. You can consider the idea of "divide and conquer", according to the IP address of the hash (IP)%1024 value, the vast number of IP logs stored in 1024 small files. Thus, each small file contains a maximum of 4MB IP addresses;

----Explanation: According to the IP to hash algorithm, to assign IP address, can guarantee the same IP address in the same file, so that it can be guaranteed in the same file although not the most, but in general is the most IP, this situation is excluded.


3. For each small file, you can build a hash map with the IP key, the number of occurrences, and the most current occurrence of the IP address;
4. You can get the most occurrences of IP in 1024 small files, and then get the most occurrences of IP based on the general sorting algorithm;

2, the search engine through the log file will be used each time the user retrieves all the retrieved strings are recorded, the length of each query string is 1-255 bytes. Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.

To solve this problem: the first is to count the number of times each query appears, and then, based on the statistical results, find top 10. So we can design this algorithm in two steps based on this idea.
That is, the resolution of this problem is divided into the following two steps :

First step: query statistics

①: Direct Sorting method
First we think of the first algorithm is sort, first of all the query in this log is sorted, and then traverse the order of the query, the number of each query appears. But the topic has a clear requirement, that is, memory can not exceed 1G, 10 million records, each record is 255Byte, it is clear to occupy 2.375G of memory, this condition is not satisfied with the requirements.

When the memory is low, we can use the method of sorting out , here we can use merge sort, because the merge sort has a relatively good time complexity O (NLGN). After finishing the sequence, we then iterate over the already ordered query file, counting the number of occurrences of each query and writing it back to the file.

②:hashtable's approach

Although there are 10 million query, but because of the high repeatability, so actually only 3 million of the query, each query255byte, so we can consider putting them all into memory, and now just need a suitable data structure, here, Hash Table is definitely our first choice, because hash table queries are very fast, almost O (1) time complexity. So, our algorithm has: to maintain a key is the query string, value is the number of times the query hashtable, each read a query, if the string is not in the table, then add the string, and the value is set to 1 If the string is in table, add a count of the string. Finally, we completed the processing of this massive amount of data in the time complexity of O (N) .

Step Two: Find TOP10 's query

①: The direct sort finds the largest 10, the complexity of the algorithm is O (NLGN). 3 million data, 1G of memory can be installed.

②: Partial ordering We first declare 10 query, and then in the process of traversal, compared with the smallest one, (using the dichotomy) to eliminate the smallest one, the algorithm's worst time complexity is n*k, where K refers to the top number.

③: Heap sort from ② can know we need a fast sort, but mobile data is faster, and as far as possible to move the data, this data mechanism is the heap, we can build a small top heap, and then in the process of traversal and root node to compare, traverse the process of maintaining the nature of the heap, In this way, the complexity of time is reduced to N ' Logk where k is the top number.

Animated Demo: http://www.benfrederickson.com/2013/10/10/heap-visualization.html

3, there is a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.

A comparison with the problem

Scenario: In a sequential read file, for each word X, take a hash (x)%5000, and then follow that value to 5,000 small files (recorded as X0,x1,... x4999). So each file is about 200k or so. If one of the files exceeds the 1M size, you can continue to do so in a similar way until the size of the resulting small file is less than 1M.
For each small file, count the words appearing in each file and the corresponding frequency (can use trie tree/hash_map , etc.), and take out the most frequent 100 words (can be used with 100 nodes of the smallest heap), and the 100 words and the corresponding frequency deposited into the file, This gets 5,000 more files. The next step is to merge the 5,000 files (similar to the merge sort) process (the data in each file is ordered).

4, There are 10 files, each file 1G, each file is stored in each row is the user's query, each file can be repeated query. Ask you to sort by the frequency of the query.

Or a typical top K algorithm, the solution is as follows:
Scenario 1:
Sequentially reads 10 files and writes the query to another 10 files (in%10) as a result of hash (query). The size of each of these newly generated files is approximately 1G (assuming that the hash function is random).
Find a machine that has about 2G in it, and then use Hash_map (query, Query_count) to count the occurrences of each query. Use quick/heap/merge sorting to sort by occurrences. Output the sorted query and the corresponding query_cout to the file.    This gives you 10 well-sequenced files (recorded as). Merge and sort the 10 files (in combination with the outer sort).

Scenario 2:
General Query The total amount is limited, but the number of repetitions is more than, perhaps for all of the query, one-time can be added to the memory. In this way, we can use trie tree/hash_map and so directly to count the number of times each query appears, and then do a quick/heap/merge sort by the number of occurrences.

Scenario 3:
Similar to Scenario 1, but after the hash has been broken into multiple files, it can be handed over to multiple files for processing using a distributed architecture (such as MapReduce) and finally merging.

6 . Find the non-repeating integer in 250 million integers, note that the memory is not enough to accommodate the 250 million integers.

Scenario 1: The use of 2-bitmap(each number allocation 2bit,00 means that there is no, 01 means one time, 10 means multiple times, 11 meaningless), a total of memory 2^32 * 2 bit=1 GB of memory, but also acceptable. Then scan these 250 million integers to see the relative bitmap in the 01,01, and if the change is 00, the 10,10 remains the same. After the stroke is finished, look at the bitmap, and the corresponding bit is 01 integer output.

Scenario 2: You can also use a method similar to the 1th question to divide a small file. Then, in the small file, find the integers that are not duplicated and sort them. Then merge and take care to remove the duplicate elements.

7, give 4 billion non-repeating unsigned int integer, not ordered, and then give a number, how to quickly determine whether the number in the 4 billion number?

First thought: Quick sort + binary search. Here are some other better ways to do this:
Scenario 1: request 512M of memory, a bit bit represents a unsigned int value. Read 4 billion numbers, set the appropriate bit, read the number to be queried, see if the corresponding bit bit is 1, 1 is present, and 0 indicates that it does not exist.

Scenario 2: This problem in the "Programming Zhu Ji Nanxiong" has a very good description, we can refer to the following ideas, to explore: And because 2^32 is more than 4 billion, so given a number may be in, may not be in it    Here we put each of the 4 billion numbers in a 32-bit binary to assume that the 4 billion numbers start with a file. Then divide the 4 billion numbers into two categories:
1. The highest bit is 0
2. The highest bit is 1
And the two classes are written to two files, one of the number of files in <=20 billion, and the other >=20 billion (which is equivalent to binary); compare the top bit of the number to be found and then go to the appropriate file and then look it up again and divide the file into two categories:
1. The second highest bit is 0
2. The second highest bit is 1

And the two classes are written to two files, one of the number of files in <=10 billion, and the other >=10 billion (which is equivalent to binary), and the number to find the next highest bit comparison and then go to the corresponding file and then find.
.......
And so on, you can find it, and the time complexity is O (logn).

-----Six, seven questions: all have the principle of using bitmaps.

8, How to find the largest number of repetitions in the massive data?

Scenario 1: Hash is done, then the module is mapped to a small file, the number of repetitions in each small file is calculated, and the number of repetitions is recorded. Then find out which one of the most repeated repetitions of the data in the previous step has been asked (refer to the previous question for details).

9, tens of millions or billions of data (there are duplicates), statistics of the most occurrences of the money n data.

Scenario 1: Tens of millions or billions of data, now the memory of the machine should be able to save. So consider using hash_map/to search the binary tree/red-black tree and so on to count the Times. Then it is to take out the first n most occurrences of the data, you can use the 2nd problem mentioned in the heap mechanism to complete.

10, a text file, about 10,000 lines, one word per line, asked to count the most frequent occurrences of the first 10 words, please give the idea, give the time complexity analysis.

Scenario 1: This question is about time efficiency. Count the number of occurrences of each word with the trie tree, and the time complexity is O (n*le) (Le denotes the alignment length of the word). Then is to find the most frequent first 10 words, can be implemented with the heap, the previous question has been mentioned, the time complexity is O (N*LG10). So the total time complexity is the larger of O (n*le) and O (N*LG10).

Find the maximum number of 100 in each 100w.

Scenario 1: In the previous question, we have mentioned, with a minimum heap of 100 elements to complete. The complexity is O (100w*lg100).

Scenario 2: The idea of using a quick sort, after each split only to consider a larger than the axis of the part, know that the larger than the axis of the large part of the time than 100, using the traditional sorting algorithm, the first 100. The complexity is O (100w*100).

Option 3: Adopt a local elimination approach. Select the first 100 elements, and sort, as sequence L. Then scan the remaining element x one at a time, compared to the smallest element in the ordered 100 elements, if it is larger than the smallest one, then delete the smallest element and insert the X into the sequence L with the idea of inserting sort. Loop in turn, knowing that all the elements have been scanned. The complexity is O (100w*100).

Dealing with massive data problems is nothing more than:

    1. Divide and conquer/hash map + hash statistics + heap/fast/merge sort;
    2. Double Barrel Division
    3. Bloom Filter/bitmap;
    4. Trie tree/database/inverted index;
    5. Out of order;
    6. The hadoop/mapreduce of distributed processing.
    7. The idea is more interesting.
    8. Reference: http://blog.csdn.net/v_july_v/article/details/7382693

Basic algorithm (four): The processing method of massive data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.