Teach you how to quickly kill: 99% of massive data processing surface questions

Last Update:2018-07-20 Source: Internet

Author: User

Tags hash mongodb repetition sort

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

transferred from: http://blog.csdn.net/v_july_v/article/details/7382693

Author: July
From : The method of the algorithm of the structure of the way blog

Preface

In general, the title contains "second Kill", "99%", "History of the most/strongest" and other words often do not take the grandstanding, but further, if the reader read this article, but no gain, then, I would be willing to bear such charges,:-), at the same time, This article can be seen as a summary of the general abstraction of the 10 large-volume data processing surfaces and 10 methods.

After all, the article and the theory limit, this article will abandon most of the details, only talk about the method/model theory, and focus on the most popular and straightforward language to explain the relevant issues. Finally, it must be emphasized that the full text is based on the analysis of the interview questions, the actual process, or the specific circumstances of the specific analysis, and the scene is far more complex than any of the circumstances described in this article.

OK, if you have any questions, please feel free to advise. Thank you.
What is mass data processing?

The so-called mass data processing, is simply based on the mass of storage, processing, operation. What is massive is that the amount of data is too large, so it is either impossible to quickly resolve in a short time, or the data is too large, resulting in the inability to load memory at once.

What about the solution? For time, we can use a clever algorithm with the appropriate data structure, such as Bloom filter/hash/bit-map/heap/database or inverted index/trie tree, for space, nothing but a way: large and small: divide and conquer/hash map, You do not say that the scale is too big, that simple ah, the scale into small scale, conquer is not finished.

As for the so-called single-machine and cluster problem, the popular point is that the machine is handling loading data limited (as long as the CPU, memory, hard disk data interaction), and the cluster, the machine has more than one, suitable for distributed processing, parallel computing (more consider the data interaction between nodes and nodes).

Moreover, through this blog about the massive data processing article: Big data processing, we have been roughly aware that the processing of huge numbers of problems, is nothing more than:

Divide-and-conquer/hash mapping + hash Statistics + heap/fast/merge sorting, double-layer bucket Division Bloom filter/bitmap; trie tree/database/inverted index; external sorting; Hadoop/mapreduce of distributed processing.

Below, the first part of this article, from Set/map talked about Hashtable/hash_map/hash_set, briefly introduced under Set/map/multiset/multimap, and Hash_set/hash_map/hash_ The difference between the Multiset/hash_multimap (lofty high-rise floor, the basis of the most important), and the second part of this article, the above 6 methods of the model combined with the corresponding mass data processing surface of the question respectively elaborated.

the first part, from Set/map, spoke of Hashtable/hash_map/hash_set

Later in the second part of this article will refer to hash_map/hash_set several times, the following a little introduction to these containers, as the basis for preparation. In general, there are two types of STL containers,
A sequence container (VECTOR/LIST/DEQUE/STACK/QUEUE/HEAP), an associative container. Associative containers are divided into set (set) and map (mapping table) two categories, as well as the two major classes of derivative multiset (multi-key set) and Multimap (Multi-key mapping table), these containers are rb-tree completed. In addition, there are 3rd class associative containers, such as Hashtable (hash list), and Hash_set (hash collection)/hash_map (hash map)/hash_multiset (Hashed Hashtable collection), which is completed with the underlying mechanism/hash_ Multimap (hash multi-key mapping table). In other words, Set/map/multiset/multimap contains a rb-tree, and Hash_set/hash_map/hash_multiset/hash_multimap contains a hashtable.

The so-called associative container, similar to the relational database, has a key value (key) and a real value (value) for each data or element, the so-called key-value (Key-value pair). When an element is inserted into an associative container, the container's internal structure (rb-tree/hashtable) is placed in the appropriate position in a particular rule, according to the size of its key value.

Included in the non-associative database, for example, in MongoDB, the document is the most basic form of data, and each document is organized in Key-value (key-value pairs). A document can have multiple key-value combinations, and each value can be of a different type, such as String, Integer, list, and so on.
{"Name": "July",
"Sex": "Male",
"Age": 23}

Set/map/multiset/multimap

Set, as with map, all elements are automatically sorted according to the key value of the element, because all of the various operations of SET/MAP are simply to invoke Rb-tree's operation behavior, but it is worth noting that neither of the two elements is allowed to have the same key value.
The difference is that the set element does not have the real value (value) and the key value (key) as well as map, the key value of the set element is the real value, the real value is the key value, and all the elements of the map are pair, with the real value (value) and the key value (key), The first element of the pair is treated as a key value, and the second element is treated as a real value.
As for Multiset/multimap, their characteristics and usage are identical to those of Set/map, except that they allow key values to be duplicated, that is, all insertions are based on Rb-tree insert_equal () rather than Insert_unique ().

Hash_set/hash_map/hash_multiset/hash_multimap

Hash_set/hash_map, all of the operations are based on Hashtable. The difference is that the hash_set is the same as set, and the real value and the key value, and the essence is the key value, the key value is the real value, and hash_map with a map, each element has a real value (value) and a key value (key), so its use, And the map above is basically the same. However, since Hash_set/hash_map are based on Hashtable, there is no automatic sorting function. Why? Because Hashtable does not have an automatic sorting function.
As for Hash_multiset/hash_multimap, the Multiset/multimap is exactly the same as the above, and the only difference is that they hash_multiset/hash_ The underlying implementation mechanism of MULTIMAP is Hashtable (and Multiset/multimap, which says that the underlying implementation mechanism is rb-tree), so their elements are not automatically sorted, but also allow the key values to be duplicated.

So, to sum up, plainly, what kind of structure determines its nature, because Set/map/multiset/multimap are based on rb-tree, so there is automatic sorting function, and hash_set/hash_map/hash_multiset/ Hash_multimap are based on Hashtable, so there is no automatic sorting function, as for the addition of a prefix multi_ is only allowed to duplicate the key value.

In addition, about what hash, please see blog this article: http://blog.csdn.net/v_JULY_v/article/details/6256463; about red and black trees, please see blog Series article:/http blog.csdn.net/v_july_v/article/category/774945, specific applications for Hash_map: http://blog.csdn.net/sdhongjun/article/details/ 4517325, about hash_set:http://blog.csdn.net/morewindows/article/details/7330323.

OK, Next, take a look at the second part of this article, dealing with the huge number of data problems six key.

The second part, dealing with the massive data problem six keys key one, divide and conquer/hash map + hash Statistics + heap/quick/merge sort 1, the massive log data, extracts one day to visit Baidu the most times the IP.
Since it is massive data processing, it is conceivable that the data to us must be huge. How do we get started with this massive amount of data? Yes, it's just a divide-and-conquer/hash map + hash statistics + heap/fast/merge sort, plainly, is the first mapping, then statistics, the last sort:
Divide-and-conquer/hash mapping: For data is too large, memory limited, can only be: the large file into a (modulo mapping) small files, that is, 16 word policy: Large and small, conquer, reduce the scale, one by one to solve the hash statistics: when the large file conversion of small files, then we can use the regular hash_map ( Ip,value) to perform frequency statistics. Heap/Quick Sort: After the statistics are done, sort (heap sort) to get the most number of IPs.

Specifically, it is: "The first is this day, and is to visit Baidu's log in the IP out to write to a large file." Note that the IP is 32-bit and has a maximum of 2^32 IP. You can also use the mapping method, such as modulo 1000, the entire large file mapping to 1000 small files, and then find the most frequently occurring IP in each of the small text (can be used hash_map to all the 1000 files in the frequency statistics, Then, find out which IP is the most frequent in each file and the corresponding frequency. Then in the 1000 largest IP, find out the most frequent IP, that is, the request. "--10 large-volume data processing surface questions and 10 methods to summarize.

There are a few more questions about the subject, as follows:

1, hash modulo is an equivalent mapping, there will not be the same element scattered to different small files, that is, the mod1000 algorithm is used here, then the same IP in the hash, only may fall in the same file, it is not possible to be dispersed.
2. What exactly is a hash map? Simply put, in order to make it easier for the computer to process big data in limited memory, the data is distributed evenly in the corresponding memory location by means of a mapping hash (such as big data is mapped into a small tree in memory, or large files are mapped into smaller files). And this mapping hash is what we usually call the hash function, the design of good hash function can let the data evenly distributed and reduce the conflict. Although the data is mapped to a different location, the data is still the original data, but instead of representing the original data, the form has changed.

In addition, a friend QuickTest used Python language practice to test the next subject, the address is as follows: http://blog.csdn.net/quicktest/article/details/7453189. Thank you. OK, interested, can also understand the consistency of the hash algorithm, see the blog in this article V: http://blog.csdn.net/v_july_v/article/details/6879101.

2. Find popular queries, 3 million most popular 10 queries in query string

title: The search engine logs all the retrieved strings used by the user each time it is retrieved using a log file, with a length of 1-255 bytes for each query string. Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is, please count the hottest 10 query strings, which requires no more than 1G of memory.

Answer: From the 1th above, we know that the data is small, but if the size of the data is small, can be loaded into memory at a time? For example, the 2nd question, although there are 10 million query, but because of the high repetition, so in fact only 3 million of the query, each query255byte, So we can consider putting them all in memory (3 million strings assuming no repetition, are the maximum lengths, then up to memory 3m*1k/4=0.75g. So that all the strings can be stored in memory for processing, and now just need a suitable data structure, here, Hashtable is definitely our first choice.

So we give up the steps of divide and conquer/hash mapping, directly on the hash statistic, and then sort. So, for this kind of typical top k problem, the countermeasures are often: HashMap + heap. As shown below:
Hash statistics: This batch of massive data preprocessing first. The method is to maintain a key is the query string, Value is the number of occurrences of the query Hashtable, that is, Hash_map (query,value), each read a query, if the string is not in the table, then add the string, and set the value to 1, or if the string is in table, add a count of the string. Finally, we completed the statistics with the hash table in the time complexity of O (N); heap sequencing: The second step, the data structure of the heap, to find the top K, time complexity of N ' logk. With the help of the heap structure, we can find and adjust/move within the time of the log magnitude. Therefore, maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element to compare. So, our final time complexity is: O (n) + N ' * O (LOGK), (n is 10 million, N ' is 3 million).

Don't forget the heap sequencing idea described in this article: "Maintaining the smallest heap of k elements, i.e. the number of K to be first traversed with the smallest heap with a capacity of k, and assuming that they are the largest number of K, building the heap time-consuming O (k), and adjusting the heap (time-consuming O (LOGK)), there is k1>k2> ... Kmin (Kmin is set as the smallest element in the small top heap). Continue to iterate through the sequence, traversing an element x each time, and comparing it to the top element of the heap, if x>kmin, update the heap (x into the heap, spents logk), or do not update the heap. This down, the total time-consuming O (k*logk+ (n-k) *logk) =o (N*LOGK). This method is due to the complexity of the operation time in the heap, such as finding and so on LOGK. "--the third chapter continues, the implementation of the TOP K algorithm problem.
Of course, you can also use the trie tree, the key field to save the query string occurrences, did not appear as 0. Finally, the occurrence frequency is sorted with the minimum push of 10 elements.

3, there is a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.
By the above two examples, divide and conquer + hash Statistics + heap/Fast sequencing this routine, we have begun to have a tried-and-do feeling. Next, take a few more to verify. Please see this question 3rd: is the file is very large, and memory is limited, what can I do? Nothing more than:
Divide and conquer/hash mappings: In sequential read files, for each word X, take a hash (x)%5000, and then follow that value to 5,000 small files (recorded as X0,x1,... x4999). So each file is about 200k or so. If one of the files exceeds the 1M size, you can continue to do so in a similar way until the size of the resulting small file is less than 1M. Hash statistics: For each small file, the use of Trie tree/hash_map and other statistics each file appears in the word and the corresponding frequency. Heap/merge Sort: Take out the 100 most frequently occurring words (you can use the smallest heap with 100 nodes), then save 100 words and corresponding frequency to the file, and then get 5,000 files. The last is the process of merging the 5,000 files (similar to the merge sort).4, massive data distribution in 100 computers, think of a way to efficiently statistics the TOP10 of this batch of data. This question is similar to the above question 3rd, heap sort: on each computer to find TOP10, can take 10 elements of the heap complete (TOP10 small, with the largest heap, TOP10 large, with the smallest heap). For example, for TOP10, we first take the first 10 elements to the minimum heap, if found, and then scan the back of the data, and compared to the top of the heap, if larger than the heap top element, then use this element to replace the heap top, and then adjust to the minimum heap. The last element in the heap is TOP10. Find out the TOP10 on each computer, then the 100 computers on the TOP10 combination, a total of 1000 data, and then use the similar method above to find out TOP10. This solution to the 4th question, the reader response to a problem, such as for example, for the 2 files in the TOP2, according to the above algorithm, if the first file is: A 49 times B 50 times C 2 times D 1 times the second file is: a 9 times B 1 times c 11 times D 10 times Although the first file comes out Top2 is B (50 times), A (49 times), the second file comes out Top2 is C (11 times), D (10 times), then 2 top2:b (50 times) A (49 times) and C (11 times) d (10 times) Merge, then the top2 of all the files is B ( 50 times), A (49 times), but actually a (58 times), B (51 times). Is it true? If so, what is the solution? As the old dream said: First of all, the data is traversed once to do a hash (to ensure that the same data items are divided into the same computer for the operation), and then according to the hash result redistribution to 100 computers, the next algorithm according to the previous. Finally, because a may appear in different computers, each has a certain number of times, and then the sum of each of the same entries (since the hash in the previous step, it is also convenient for each computer only need to separate the items in the sum, not related to other computers, scale down).5, there are 10 files, each file 1G, each file is stored in each row is the user's query, each file can be repeated query. Ask you to sort by the frequency of the query.

Scenario 1: Directly on:
Hash mapping: Reads 10 files sequentially, writes the query to another 10 files according to the result of hash (query)%10 (recorded as A0,a1,.. A9). The size of each of these newly generated files is approximately 1G (assuming that the hash function is random). Hash statistics: Find a machine within 2G or so, in turn, use Hash_map (query, Query_count) to count the number of times each query appears. Note: Hash_map (query,query_count) is used to count the occurrences of each query, not to store their values, to appear once, then count+1. Heap/Quick/merge sort: Use the quick/heap/merge sort to sort by occurrences, output sorted query and corresponding query_cout to a file, and get 10 well-ordered files (recorded as). Finally, the 10 files are sorted by merging (in combination with the outer sort). According to this scenario 1, here is an implementation: https://github.com/ooooola/sortquery/blob/master/querysort.py. In addition, the following two methods of the problem:
Scenario 2: General Query The total amount is limited, but the number of repetitions is more, perhaps for all of the query, can be added to the memory at once. In this way, we can use trie tree/hash_map and so directly to count the number of times each query appears, and then do a quick/heap/merge sort by the number of occurrences.

Scenario 3: Similar to Scenario 1, but after hashing, divided into multiple files, can be handed over to a number of files to process, the use of distributed architecture (such as MapReduce), and finally merge.

6, given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL.

The size of each file can be estimated at 5gx64=320g, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider adopting a divide-and-conquer approach.
Divide-and-conquer/hash mapping: Traverse file A, fetch each URL, and then store the URL to 1000 small files (in the case of a A1) based on the value obtained. This is approximately 300M for each small file. Traversing file B, take the same way as a to store the URL in 1000 small files (recorded as). After this processing, all the possible same URLs are in the corresponding small file (), the corresponding small file cannot have the same URL. Then we only ask for 1000 of the same URL in the small file. Hash statistics: When you request the same URL for each pair of small files, you can store the URL of one of the small files in the hash_set. Then traverse each URL of the other small file, see if it is in the hash_set just built, if so, then is the common URL, stored in the file can be.

OK, this first method: Divide and conquer/hash map + hash Statistics + heap/quick/merge sort, and then see the last 4 questions, as follows:

7, how to find the largest number of repetitions in the massive data.

Scenario 1: Hash is done, then the module is mapped to a small file, the number of repetitions in each small file is calculated, and the number of repetitions is recorded. Then find out which one of the most repeated repetitions of the data in the previous step has been asked (refer to the previous question for details).

8, tens of millions or billions of data (there are duplicates), statistics of the most occurrences of the money n data.

Scenario 1: Tens of millions or billions of data, now the memory of the machine should be able to save. So consider using hash_map/to search the binary tree/red-black tree and so on to count the Times. Then it is to take out the first n most occurrences of the data, you can use the 2nd problem mentioned in the heap mechanism to complete.

9, a text file, about 10,000 lines, one word per line, asked to count the most frequent occurrences of the first 10 words, please give the idea, give the time complexity analysis.

Scenario 1: This question is about time efficiency. Count the number of occurrences of each word with the trie tree, and the time complexity is O (n*le) (Le denotes the alignment length of the word). Then is to find the most frequent first 10 words, can be implemented with the heap, the previous question has been mentioned, the time complexity is O (N*LG10). So the total time complexity is the larger of O (n*le) and O (N*LG10).

10.10 million strings, some of which are duplicates, need to remove all duplicates, leaving no duplicate strings. Please how to design and implement. Scenario 1: This problem with trie tree more appropriate, Hash_map also line. Scenario 2:from xjbzju:,1000w The data size insert operation is completely unrealistic, previously tried in the STL under the 100w element insert set is too slow to endure, think that based on the implementation of the hash is not much better than the red-black tree, using the vector+sort+ Unique is a lot of feasible, it is recommended to hash into small files separately processing and synthesis. The method of reader Xbzju in the above scenario 2 reminds me of some problems, namely, set/map, compared with Hash_set/hash_map's performance? Total 3 questions, as follows: 1, hash_set in TENS data, insert operation is better than set? Is the practical data given by this blog:http://t.cn/zoibp7t reliable? 2. What is the performance comparison between map and Hash_map? Who has done the experiment?

3, the query operation, the following paragraph of the text described?

Or a small amount of data with a map, the construction of fast, large data volume with Hash_map?

Rbtree PK Hashtable

According to friends № Bang Cat № 's red and black tree and hash table performance test found: When the amount of data is basically int type key, hash table is Rbtree 3-4 times, but hash table generally waste about half of memory.

Because the hash table does the operation is a%, and rbtree to compare a lot, such as rbtree to see the value of the data, each node to more than 3 pointers (or offsets) if other functions are required, for example, to count the number of keys within a range, you need to add a count member. and 1s rbtree can carry on about 50w+ times to insert, hash table is approximately 200w times. But a lot of times, its speed can be tolerated, such as inverted index is almost the speed, and single-threaded, and the inverted table of the zipper length is not too large. Because the tree-based implementation is not much slower than Hashtable, so the database index is generally used in the b/b+ tree, and B + tree is also disk-friendly (s-tree can effectively reduce its height, so reduce the number of disk interactions). For example, the very popular NoSQL database, like MongoDB, also uses the B-tree index. For the B-Tree series, please refer to this blog in this article: from the B-tree, plus tree, b* tree talk about R-Tree.

OK, please wait for further experimental argument. Next, let's look at the second method, double-sided poke division.

key two, double barrel Division

The two-tier bucket division----In fact, the idea of divide and conquer in essence, in the "points" of the skill.
Scope of application: K-Large, median, non-repeating or repetitive numbers
Basic principle and key points: Because the element scope is very large, can not use the direct addressing table, so through multiple division, gradually determine the scope, and then finally within an acceptable range. Can be reduced by several times, the double layer is just an example.
Extended:
Problem Example:

11, 250 million integers to find the number of distinct integers, memory space is not enough to accommodate these 250 million integers.
A bit like the pigeon nest principle, the whole number is 2^32, that is, we can divide this 2^32 number into 2^8 region (for example, with a single file to represent an area), and then separate the data into different regions, and then different areas can be directly solved by using bitmap. This means that as long as there is enough disk space, it can be easily solved.

12, 500 million int to find the median of them.
Idea One: This example is more obvious than the one above. First we divide int into 2^16, and then we read the number of the numbers that fall into each region, and then we can judge the median by the statistical results, and know that the number of the numbers in this area is exactly the median. And then the second scan, we just count the numbers that fall in this area.
In fact, if it's not int is int64, we can go through 3 of these divisions to be reduced to acceptable levels. 　　That is, the int64 can be divided into 2^24 areas, and then determine the number of regions, in the region into the 2^20 sub-region, and then determine the number of sub-region of the number of numbers, and then in the sub-region only 2^20, you can directly use Direct addr table statistics. Idea two @ Green jacket: Also need to do two times statistics, if the data exists on the hard disk, you need to read 2 times. The
method is similar to the cardinality sort, open an int array of size 65536, first read, statistics Int32 High 16 bits of the case, that is, 0-65535, are counted as 0,65536-131071 are counted as 1. It is equivalent to dividing that number by 65536. Int32 divided by 65536 results in no more than 65536 cases, so it is possible to open an array with a length of 65536. Each read a number, the corresponding count in the array +1, consider a negative case, you need to add 32768 after the result, recorded in the corresponding array.
After the first statistic, iterating through the array, accumulating statistics one by one, where the median is in the interval, for example, in the interval k, then the number sum of numbers in the 0-k-1 interval should be <N/2 (250 million). And the k+1-65535 count and also <N/2, the second time the statistic is similar to the above method, but only in the case of the interval k, that is to say (x/65536) + 32768 = k. Statistics only count low 16-bit cases. and using the sum of the statistics just now, such as Sum = 249 million, then it is time to find 1 million numbers (250 million-249 million) in the lower 16 bits. After this count, and then statistics, the number of places to look at the interval, the final combination of high and low is the result.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Teach you how to quickly kill: 99% of massive data processing surface questions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Teach you how to quickly kill: 99% of massive data processing surface questions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support