Massive data plane question----divide and conquer/hash map + hash Statistics + heap/quick/merge sort

Source: Internet
Author: User
Tags file url repetition

1, from Set/map talked about Hashtable/hash_map/hash_set

Later in the second part of this article will refer to hash_map/hash_set several times, the following a little introduction to these containers, as the basis for preparation. In general, there are two types of STL containers:

Sequence container (VECTOR/LIST/DEQUE/STACK/QUEUE/HEAP),

An associative container. Associative containers are divided into set (set) and map (mapping table) two categories, as well as the two major classes of derivative multiset (multi-key set) and Multimap (Multi-key mapping table), these containers are rb-tree completed. In addition, there are 3rd class associative containers, such as Hashtable (hash list), and Hash_set (hash collection)/hash_map (hash map)/hash_multiset (Hashed Hashtable collection), which is completed with the underlying mechanism/hash_ Multimap (hash multi-key mapping table). In other words, Set/map/multiset/multimap contains a rb-tree, and Hash_set/hash_map/hash_multiset/hash_multimap contains a hashtable.

The so-called associative container, similar to the relational database, has a key value (key) and a real value (value) for each data or element, the so-called key-value (Key-value pair). When an element is inserted into an associative container, the container's internal structure (rb-tree/hashtable) is placed in the appropriate position in a particular rule, according to the size of its key value.

Included in the non-associative database, for example, in MongoDB, the document is the most basic form of data, and each document is organized in Key-value (key-value pairs). A document can have multiple key-value combinations, and each value can be of a different type, such as String, Integer, list, and so on. {"Name": "July", "Sex": "Male", "Age": 23}

Set/map/multiset/multimap:

Set, as with map, all elements are automatically sorted according to the key value of the element, because all of the various operations of SET/MAP are simply to invoke Rb-tree's operation behavior, but it is worth noting that neither of the two elements is allowed to have the same key value.
The difference is that the set element does not have the real value (value) and the key value (key) as well as map, the key value of the set element is the real value, the real value is the key value, and all the elements of the map are pair, with the real value (value) and the key value (key), The first element of the pair is treated as a key value, and the second element is treated as a real value.
As for Multiset/multimap, their characteristics and usage are identical to those of Set/map, except that they allow key values to be duplicated, that is, all insertions are based on Rb-tree insert_equal () rather than Insert_unique ().

Hash_set/hash_map/hash_multiset/hash_multimap:

Hash_set/hash_map, all of the operations are based on Hashtable. The difference is that the hash_set is the same as set, and the real value and the key value, and the essence is the key value, the key value is the real value, and hash_map with a map, each element has a real value (value) and a key value (key), so its use, And the map above is basically the same. However, since Hash_set/hash_map are based on Hashtable, there is no automatic sorting function. Why? Because Hashtable does not have an automatic sorting function.
As for Hash_multiset/hash_multimap, the Multiset/multimap is exactly the same as the above, and the only difference is that they hash_multiset/hash_ The underlying implementation mechanism of MULTIMAP is Hashtable (and Multiset/multimap, which says that the underlying implementation mechanism is rb-tree), so their elements are not automatically sorted, but also allow the key values to be duplicated.

So, to sum up, plainly, what kind of structure determines its nature, because Set/map/multiset/multimap are based on rb-tree, so there is automatic sorting function, and hash_set/hash_map/hash_multiset/ Hash_multimap are based on Hashtable, so there is no automatic sorting function, as for the addition of a prefix multi_ is only allowed to duplicate the key value.

2, search for popular query: The search engine through the log file will be used each time the user retrieves all the retrieved strings are recorded, each query string length is 1-255 bytes.

assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is, please count the hottest 10 query strings, which requires no more than 1G of memory.

Solution: Although there are 10 million query, but because of the high repetition, so in fact only 3 million of the query, each query255byte, (300w*255b<1g, can read all the data into memory), so we can consider to put them in memory , and now just need a suitable data structure, here, Hash table is definitely our first choice. So we give up the steps of divide and conquer/hash mapping, directly on the hash statistic, and then sort. So:

    1. Hash statistics: This batch of massive data preprocessing (maintain a key for the query string, Value is the number of occurrences of the query Hashtable, that is, Hash_map (query,value), each read a query, if the string is not in the table , add the string and set the value to 1, or if the string is in table, add a count of the string. Finally, we completed the statistics with the hash table in the time complexity of O (N);
    2. Heap Sort: The second step, with the help of heap this data structure, find top K, time complexity is n ' logk. With the help of the heap structure, we can find and adjust/move within the time of the log magnitude. Therefore, maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element for comparison so, our final time complexity is: O (n) + N ' *o (LOGK), (n is 10 million, N ' is 3 million).

Heap Sorting ideas: "Maintenance of the smallest heap of k elements, that is, the smallest heap with a capacity of K first traversed to the number of K, and assume that they are the largest number of K, build Heap time O (k), and adjust the heap (time-consuming O (LOGK)), after the K1>k2> Kmin (Kmin is set as the smallest element in the small top heap). Continue traversing the sequence, iterating through an element x each time, comparing it to the top element of the heap, and if x>kmin, updating the heap (spents logk), or not updating the heap. This down, the total time-consuming O (k*logk+ (n-k) *logk) =o (N*LOGK). This method is due to the complexity of the operation time in the heap, such as finding and so on LOGK.

3, there is a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.

Solution: (1g=5000*200k, divide the file into 5,000 small files, 200k per file)

1) Divide-and-conquer/hash mapping: In sequential read files, for each word X, take hash (x)%5000, and then follow that value to 5,000 small files (recorded as X0,x1,... x4999). So each file is about 200k, and each file holds a word with the same hash value. If one of the files exceeds the 1M size, you can continue to do so in a similar way until the size of the resulting small file is less than 1M.

2) Hash statistics: For each small file, the use of Trie tree/hash_map and other statistics in each file appear in the word and the corresponding frequency.

3) Heap/merge sort: Take out the 100 words with the most frequent occurrences (you can use the smallest heap with 100 nodes) and deposit 100 words and corresponding frequencies into the file, so that you get 5,000 files. The last is the process of merging the 5,000 files (similar to the merge sort).

4, massive data distribution in 100 computers, think of a way to efficiently statistics the TOP10 of this batch of data.

1) heap sorting: On each computer to find TOP10, can take 10 elements of the heap complete (TOP10 small, with the largest heap, TOP10 large, with the smallest heap). For example, for TOP10, we first take the first 10 elements to the minimum heap, if found, and then scan the back of the data, and compared to the top of the heap, if larger than the heap top element, then use this element to replace the heap top, and then adjust to the minimum heap. The last element in the heap is TOP10.

2) to find the TOP10 on each computer, and then the 100 computers on the TOP10 combination, a total of 1000 data, and then use the similar method above to find TOP10.

This solution to the 4th question, after the reader response to a problem, such as for example, for the 2 files in the TOP2, according to the above algorithm, if the first file has:
a 49 times
B 50 times
C 2 times
d 1 Times
In the second file are:
a 9 times
B 1 Times
C 11 times
d 10 times
Although the first file comes out Top2 is B (50 times), A (49 times), the second file comes out Top2 is C (11 times), D (10 times), then 2 top2:b (50 times) A (49 times) and C (11 times) d (10 times) Merge, then the top2 of all the files is B ( 50 times), A (49 times), but actually a (58 times), B (51 times).      Is it true? If so, what is the solution?      First, all the data is traversed once to do a hash (to ensure that the same data items are divided into the same computer for the operation), and then according to the hash result redistribution to 100 computers, the next algorithm according to the previous. Finally, because a may appear in different computers, each has a certain number of times, and then the sum of each of the same entries (since the hash in the previous step, it is also convenient for each computer only need to separate the items in the sum, not related to other computers, scale down).
5, there are 10 files, each file 1G, each file is stored in each row is the user's query, each file can be repeated query. Ask you to sort by the frequency of the query. Scenario 1: Similar to question 3rd

1) Hash mapping: Sequentially reads 10 files and writes the query to a further 10 files (in%10) as a result of hash (query). The size of each of these newly generated files is approximately 1G (assuming that the hash function is random).

2) Hash Statistics: Find a machine in the presence of about 2G, in turn, with hash_map (query, Query_count) to count the number of times each query appears. Note: Hash_map (query,query_count) is used to count the occurrences of each query, not to store their values, to appear once, then count+1.

3) heap/Quick/merge sort: Use fast/heap/merge sort to sort by occurrences, output sorted query and corresponding query_cout to file, and get 10 well-ordered files (recorded as). Finally, the 10 files are sorted by merging (in combination with the outer sort).


Scenario 2: General Query The total amount is limited, but the number of repetitions is more, perhaps for all of the query, can be added to the memory at once. In this way, we can use trie tree/hash_map and so directly to count the number of times each query appears, and then do a quick/heap/merge sort by the number of occurrences.

Scenario 3: Similar to Scenario 1, but after hashing, divided into multiple files, can be handed over to a number of files to process, the use of distributed architecture (such as MapReduce), and finally merge.

6, given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

The size of each file can be estimated at 5gx64=320g, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider adopting a divide-and-conquer approach.

1) Divide-and-conquer/hash mapping: Traverse file A, fetch each URL, and then store the URL in 1000 small files (recorded as) based on the obtained value. This is approximately 300M for each small file. Traversing file B, take the same way as a to store the URL in 1000 small files (recorded as). After this processing, all the possible same URLs are in the corresponding small file (), the corresponding small file cannot have the same URL. Then we only ask for 1000 of the same URL in the small file.

2) Hash statistics: Each pair of small files in the same URL, you can save one of the small file URL to hash_set. Then traverse each URL of the other small file, see if it is in the hash_set just built, if so, then is the common URL, stored in the file can be.

7, how to find the largest number of repetitions in the massive data?

1) Do hash mapping first, and then map the contents of large files to small files.

2) then hash statistics, find the number of repetitions in each small file, and record the number of repetitions.

3) The last quick sort/heap sort/merge sort, to find out the number of repetitions of the data obtained in the previous step is the one that is asked

8, tens of millions or billions of data (there are duplicates), statistics of the most occurrences of the money n data.

1) If the data can be put directly into the memory, it is not necessary to map the hash to multiple small files.

2) Use hash_map/search binary tree/red black tree, etc. to carry out statistics times.

3) then is to take out the first n most occurrences of the data, you can use the 2nd problem mentioned in the heap mechanism to complete.

Massive data plane question----divide and conquer/hash map + hash Statistics + heap/quick/merge sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.