Processing problems of set and hash_set and massive data

Source: Internet
Author: User
Tags file url repetition

What kind of structure determines its nature, because Set/map/multiset/multimap are based on rb-tree, so there is automatic sorting function,

And Hash_set/hash_map/hash_multiset/hash_multimap are based on Hashtable, so there is no automatic sorting function, as for the addition of a prefix multi_ is only allowed to duplicate the key value.

2, search for popular query: The search engine through the log file will be used each time the user retrieves all the retrieved strings are recorded, each query string length is 1-255 bytes.

assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is, please count the hottest 10 query strings, which requires no more than 1G of memory.

Solution: Although there are 10 million query, but because of the high repetition, so in fact only 3 million of the query, each query255byte, (300w*255b<1g, can read all the data into memory), so we can consider to put them in memory , and now just need a suitable data structure, here, Hash table is definitely our first choice. So we give up the steps of divide and conquer/hash mapping, directly on the hash statistic, and then sort. So:

    1. Hash statistics: This batch of massive data preprocessing (maintain a key for the query string, Value is the number of occurrences of the query Hashtable, that is, Hash_map (query,value), each read a query, if the string is not in the table , add the string and set the value to 1, or if the string is in table, add a count of the string. Finally, we completed the statistics with the hash table in the time complexity of O (N);
    2. Heap Sort: The second step, with the help of heap this data structure, find top K, time complexity is n ' logk. With the help of the heap structure, we can find and adjust/move within the time of the log magnitude. Therefore, maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element for comparison so, our final time complexity is: O (n) + N ' *o (LOGK), (n is 10 million, N ' is 3 million).

Heap Sorting ideas: "Maintenance of the smallest heap of k elements, that is, the smallest heap with a capacity of K first traversed to the number of K, and assume that they are the largest number of K, build Heap time O (k), and adjust the heap (time-consuming O (LOGK)), after the K1>k2> Kmin (Kmin is set as the smallest element in the small top heap). Continue traversing the sequence, iterating through an element x each time, comparing it to the top element of the heap, and if x>kmin, updating the heap (spents logk), or not updating the heap. This down, the total time-consuming O (k*logk+ (n-k) *logk) =o (N*LOGK). This method is due to the complexity of the operation time in the heap, such as finding and so on LOGK.

3, there is a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.

Solution: (1g=5000*200k, divide the file into 5,000 small files, 200k per file)

1) Divide-and-conquer/hash mapping: In sequential read files, for each word X, take hash (x)%5000, and then follow that value to 5,000 small files (recorded as X0,x1,... x4999). So each file is about 200k, and each file holds a word with the same hash value. If one of the files exceeds the 1M size, you can continue to do so in a similar way until the size of the resulting small file is less than 1M.

2) Hash statistics: For each small file, the use of Trie tree/hash_map and other statistics in each file appear in the word and the corresponding frequency.

3) Heap/merge sort: Take out the 100 words with the most frequent occurrences (you can use the smallest heap with 100 nodes) and deposit 100 words and corresponding frequencies into the file, so that you get 5,000 files. The last is the process of merging the 5,000 files (similar to the merge sort).

4, massive data distribution in 100 computers, think of a way to efficiently statistics the TOP10 of this batch of data.

1) heap sorting: On each computer to find TOP10, can take 10 elements of the heap complete (TOP10 small, with the largest heap, TOP10 large, with the smallest heap). For example, for TOP10, we first take the first 10 elements to the minimum heap, if found, and then scan the back of the data, and compared to the top of the heap, if larger than the heap top element, then use this element to replace the heap top, and then adjust to the minimum heap. The last element in the heap is TOP10.

2) to find the TOP10 on each computer, and then the 100 computers on the TOP10 combination, a total of 1000 data, and then use the similar method above to find TOP10.
5, there are 10 files, each file 1G, each file is stored in each row is the user's query, each file can be repeated query. Ask you to sort by the frequency of the query.

Scenario 1: Similar to question 3rd

1) Hash mapping: Sequentially reads 10 files and writes the query to a further 10 files (in%10) as a result of hash (query). The size of each of these newly generated files is approximately 1G (assuming that the hash function is random).

2) Hash Statistics: Find a machine in the presence of about 2G, in turn, with hash_map (query, Query_count) to count the number of times each query appears. Note: Hash_map (query,query_count) is used to count the occurrences of each query, not to store their values, to appear once, then count+1.

3) heap/Quick/merge sort: Use fast/heap/merge sort to sort by occurrences, output sorted query and corresponding query_cout to file, and get 10 well-ordered files (recorded as). Finally, the 10 files are sorted by merging (in combination with the outer sort).


Scenario 2: General Query The total amount is limited, but the number of repetitions is more, perhaps for all of the query, can be added to the memory at once. In this way, we can use trie tree/hash_map and so directly to count the number of times each query appears, and then do a quick/heap/merge sort by the number of occurrences.

Scenario 3: Similar to Scenario 1, but after hashing, divided into multiple files, can be handed over to a number of files to process, the use of distributed architecture (such as MapReduce), and finally merge.

6, given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL?

The size of each file can be estimated at 5gx64=320g, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider adopting a divide-and-conquer approach.

1) Divide-and-conquer/hash mapping: Traverse file A, fetch each URL, and then store the URL in 1000 small files (recorded as) based on the obtained value. This is approximately 300M for each small file. Traversing file B, take the same way as a to store the URL in 1000 small files (recorded as). After this processing, all the possible same URLs are in the corresponding small file (), the corresponding small file cannot have the same URL. Then we only ask for 1000 of the same URL in the small file.

2) Hash statistics: Each pair of small files in the same URL, you can save one of the small file URL to hash_set. Then traverse each URL of the other small file, see if it is in the hash_set just built, if so, then is the common URL, stored in the file can be.

7, how to find the largest number of repetitions in the massive data?

1) Do hash mapping first, and then map the contents of large files to small files.

2) then hash statistics, find the number of repetitions in each small file, and record the number of repetitions.

3) The last quick sort/heap sort/merge sort, to find out the number of repetitions of the data obtained in the previous step is the one that is asked

8, tens of millions or billions of data (there are duplicates), statistics of the most occurrences of the money n data.

1) If the data can be put directly into the memory, it is not necessary to map the hash to multiple small files.

2) Use hash_map/search binary tree/red black tree, etc. to carry out statistics times.

3) then is to take out the first n most occurrences of the data, you can use the 2nd problem mentioned in the heap mechanism to complete.

Processing problems of set and hash_set and massive data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.