Summary of questions and methods of common mass data processing surface

Source: Internet
Author: User
Tags comparison hash repetition sort

1, the massive log data, extracts one day to visit Baidu the most times the IP.

This problem, in one of my previous article algorithm mentioned, at that time the scheme is: the number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then statistics.

This scenario is described in more detail: The first is this day, and is to visit Baidu's log in the IP out, write to a large file. Note that the IP is 32-bit and has a maximum of 2^32 IP. The same can be used to map the method, such as module 1000, the entire large file mapping to 1000 small files, and then find out the frequency of each of the most frequent IP (can be used hash_map frequency statistics, and then find the largest number of frequencies) and the corresponding frequency. Then in the 1000 largest IP, find out the most frequent IP, that is, the request.

2, the search engine through the log file will be used each time the user retrieves all the retrieved strings are recorded, the length of each query string is 1-255 bytes.

Assume that there are currently 10 million records (these query strings have a high degree of repetition, although the total is 10 million, but if you remove the duplicates, no more than 3 million. The higher the repetition of a query string, the more users are queried for it, the more popular it is. ), please count the most popular 10 query strings, requiring no more than 1G of memory to use.

The typical top K algorithm is also described in this article. In the paper, the final algorithm is as follows: The first step, preprocessing the batch of massive data first, and using the hash table to complete the sorting in the time of O (N); Then, the second step, with the help of heap this data structure, to find the top K, time complexity of N ' logk. That is, with the heap structure, we can find and adjust/move within the time of the log magnitude. Therefore, maintain a K (10) Size of the small Gan, and then traverse 3 million of the query, respectively, and the root element for comparison so, our final time complexity is: O (n) + N ' *o (LOGK), (n is 10 million, N ' is 3 million). OK, more details, please refer to the original text.

Or: Using the trie tree, the number of occurrences of the query string in the key field does not appear as 0. Finally, the occurrence frequency is sorted with the minimum push of 10 elements.

3, there is a 1G size of a file, each line is a word, the size of the word does not exceed 16 bytes, memory limit size is 1M. Returns the highest frequency of 100 words.

Scenario: In a sequential read file, for each word X, take a hash (x)%5000, and then follow that value to 5,000 small files (recorded as X0,x1,... x4999). So each file is about 200k or so.

If one of the files exceeds the 1M size, you can continue to do so in a similar way until the size of the resulting small file is less than 1M. For each small file, statistics appear in each file and the corresponding frequency (can be used trie tree/hash_map, etc.), and take out the most frequent occurrence of the 100 words (can be used with a minimum of 100 nodes), and 100 words and the corresponding frequency into the file, and then get 5,000 files. The next step is to merge the 5,000 files (similar to the merge sort) process.

4, there are 10 files, each file 1G, each file is stored in each row is the user's query, each file can be repeated query. Ask you to sort by the frequency of the query.

Or a typical top K algorithm, the solution is as follows: Scenario 1: Sequentially reads 10 files and writes the query to another 10 files (in%10) as a result of hash (query). The size of each of these newly generated files is approximately 1G (assuming that the hash function is random). Find a machine that has about 2G in it, and then use Hash_map (query, Query_count) to count the occurrences of each query. Use quick/heap/merge sorting to sort by occurrences. Output the sorted query and the corresponding query_cout to the file. This gives you 10 well-sequenced files (recorded as).

Merge and sort the 10 files (in combination with the outer sort).

Scenario 2: General Query The total amount is limited, but the number of repetitions is more, perhaps for all of the query, can be added to the memory at once. In this way, we can use trie tree/hash_map and so directly to count the number of times each query appears, and then do a quick/heap/merge sort by the number of occurrences.

Scenario 3: Similar to Scenario 1, but after hashing, divided into multiple files, can be handed over to a number of files to process, the use of distributed architecture (such as MapReduce), and finally merge.

5, given a, b two files, each store 5 billion URLs, each URL accounted for 64 bytes, memory limit is 4G, let you find a, b file common URL.

Scenario 1: You can estimate the size of each file is 5gx64=320g, which is much larger than the memory limit of 4G. So it is not possible to fully load it into memory for processing. Consider adopting a divide-and-conquer approach.

Traverse file A, hash (URL)%1000 for each URL, and then store the URL into 1000 small files (in A0,a1,..., a999) based on the obtained values. This is approximately 300M for each small file.

Traversing file B, take the same way as a to store the URL in 1000 small files (b0,b1,..., b999). After this processing, all possible identical URLs are in the corresponding small file (a0vsb0,a1vsb1,..., a999vsb999), the corresponding small file cannot have the same URL. Then we only ask for 1000 of the same URL in the small file.

When you request the same URL for each pair of small files, you can store the URL of one of the small files in Hash_set. Then traverse each URL of the other small file, see if it is in the hash_set just built, if so, then is the common URL, stored in the file can be.

Scenario 2: If you allow a certain error rate, you can use Bloom filter,4g memory to probably represent 34 billion bits. Map the URLs in one of the files to these 34 billion bits using the bloom filter, then read the URL of the other file one at a time, check if it is with Bloom filter, and if so, then the URL should be a common URL (note that there will be a certain error rate).

Bloom filter will be described in detail in this blog later.

6. Find the non-repeating integer in 250 million integers, note that the memory is not enough to accommodate the 250 million integers.

Scenario 1: The use of 2-bitmap (each number allocation 2bit,00 means that there is no, 01 means one time, 10 means multiple times, 11 meaningless), a total memory memory, and can be accepted. Then scan these 250 million integers to see the relative bitmap in the 01,01, and if the change is 00, the 10,10 remains the same. After the stroke is finished, look at the bitmap, and the corresponding bit is 01 integer output.

Scenario 2: You can also use a method similar to the 1th question to divide a small file. Then, in the small file, find the integers that are not duplicated and sort them. Then merge and take care to remove the duplicate elements.

7, Tencent interview questions: to 4 billion non-repeating unsigned int integer, not ordered, and then give a number, how to quickly determine whether the number in the 4 billion number.

Similar to the 6th question, my first reaction was to quickly sort + binary find. Here are some other better ways: scenario 1:oo, request 512M of memory, a bit bit representing a unsigned int value. Read 4 billion numbers, set the appropriate bit, read the number to be queried, see if the corresponding bit bit is 1, 1 is present, and 0 indicates that it does not exist.

Dizengrong: Scenario 2: This problem in the "Programming Zhu Ji Nanxiong" has a very good description, we can refer to the following ideas, to explore: And because the 2^32 is more than 4 billion, so given a number may be in, may not be in it Here we put each of the 4 billion numbers in a 32-bit binary to assume that the 4 billion numbers start with a file.

The 4 billion numbers are then divided into two categories: 1. The highest bit is 0 2. The highest bit is 1 and the two classes are written to two files, one of the number of <=20 billion, and the other >=20 billion (this is equivalent to binary) ; compare to the highest bit of the number you want to find and then go to the appropriate file and find

Then the file is divided into two categories: 1. The second highest bit is 0 2. The second highest bit is 1

And the two classes are written to two files, one of the number of files in <=10 billion, and the other >=10 billion (which is equivalent to binary), and the number to find the next highest bit comparison and then go to the corresponding file and then find. ....... And so on, it can be found, and the time complexity of O (LOGN), the completion of Scenario 2.

Attached: here, and then briefly, the bitmap method: Using the Bitmap method to determine whether there is duplicate judgment array is a common programming task, when there is a large amount of data in the collection, we usually want to do less scanning, then the double-loop method is not taken.

The bitmap method is suitable for this case, it is done by the largest element in the collection of Max to create a new array of length max+1, and then scan the original array again, encountered a few to the new array of the first position of 1, if encountered 5 to the new array of the sixth element 1, So the next time you encounter 5 want to set the new array to find the sixth element is already 1, which indicates that the data and the previous data must be duplicated. This method of initializing the new array with 0 is similar to the bitmap method, so it is called a bitmap approach. It has a worst-case operation of 2N. If the maximum value of the known quantity group can be fixed in advance for the new array, efficiency can be increased by one more time.

8, how to find the largest number of repetitions in the massive data.

Scenario 1: Hash is done, then the module is mapped to a small file, the number of repetitions in each small file is calculated, and the number of repetitions is recorded. Then find out which one of the most repeated repetitions of the data in the previous step has been asked (refer to the previous question for details).

9, tens of millions or billions of data (there are duplicates), statistics of the most occurrences of the money n data.

Scenario 1: Tens of millions or billions of data, now the memory of the machine should be able to save. So consider using hash_map/to search the binary tree/red-black tree and so on to count the Times. Then it is to take out the first n most occurrences of the data, you can use the 2nd problem mentioned in the heap mechanism to complete.

10, a text file, about 10,000 lines, one word per line, asked to count the most frequent occurrences of the first 10 words, please give the idea, give the time complexity analysis.

Scenario 1: This question is about time efficiency. Count the number of occurrences of each word with the trie tree, and the time complexity is O (n*le) (Le denotes the alignment length of the word). Then is to find the most frequent first 10 words, can be implemented with the heap, the previous question has been mentioned, the time complexity is O (N*LG10). So the total time complexity is the larger of O (n*le) and O (N*LG10).

The maximum number of 100 is found in the attached and 100w numbers.

Scenario 1: In the previous question, we have mentioned, with a minimum heap of 100 elements to complete. The complexity is O (100w*lg100).

Scenario 2: The idea of using a quick sort, after each split only to consider a larger than the axis of the part, know that the larger than the axis of the large part of the time than 100, using the traditional sorting algorithm, the first 100. The complexity is O (100w*100).

Option 3: Adopt a local elimination approach. Select the first 100 elements, and sort, as sequence L. Then scan the remaining element x one at a time, compared to the smallest element in the ordered 100 elements, if it is larger than the smallest one, then delete the smallest element and insert the X into the sequence L with the idea of inserting sort. Loop in turn, knowing that all the elements have been scanned. The complexity is O (100w*100).

The second part, 10 massive data processing method Big Summary

OK, see the above so many face questions, whether a little dizzy. Yes, a summary is needed. Next, this article will briefly summarize some common methods to deal with massive data problems.

The following methods are all from the http://hi.baidu.com/yanxionglu/blog/blog, a large amount of data processing methods to a general summary, of course, these methods may not completely cover all the problems, But some of these methods can also deal with most of the problems encountered. The following questions are basically directly from the company's interview written questions, methods are not necessarily optimal, if you have a better approach, welcome to discuss.

One, Bloom filter

Scope of application: can be used to implement the data dictionary, the data of the weight, or set to find the intersection

Basic principles and key points:

For the principle is very simple, bit array +k a separate hash function. The bit array of the value corresponding to the hash function is set to 1, and if it is found that all the corresponding bits of the hash function are 1, it is clear that this process does not guarantee that the result of the lookup is 100% correct. It is also not supported to delete a keyword that has already been inserted, because the bit that corresponds to the keyword affects other keywords. So a simple improvement is counting Bloom filter, which can support deletion by replacing the bit array with a counter array.

There is a more important question, how to determine the size of the bit array m and the number of hash functions according to the number of input elements N. The error rate is minimized when the number of hash functions is k= (LN2) * (m/n). In cases where the error rate is not greater than E, M must be at least equal to N*LG (1/e) to represent a collection of any n elements. But M should also be larger, because the bit array is also guaranteed to be at least half 0, then M should >=NLG (1/e) *lge is probably NLG (1/e) 1.44 times times (LG represents 2 logarithm).

For example, we assume that the error rate is 0.01, then M should be about 13 times times the N. So k is probably 8.

Note that M is different from N's units, M is bit, and N is the number of elements (exactly the number of different elements). The length of a single element is usually a lot of bits. So the use of Bloom filter memory is usually saved.

Extended:

Bloom filter maps the elements in the collection into the array, with K (k for the hash function number), whether all 1 indicates that the element is not in this set. Counting Bloom Filter (CBF) expands each bit in the bit array to a counter, enabling the deletion of the element. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in counter to approximate how often the element appears.

Problem Example: give you a, b two files, each store 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4G, let you find the common URL of a, b file. If it is three or even n files.

According to this problem we calculate the use of memory, 4G=2^32 is about 4 billion *8 is probably 34 billion, N=50 billion, if the error rate 0.01 is required is probably 65 billion bit. Now available is 34 billion, the difference is not much, this may cause the error rate to rise some. In addition, if these urlip are one by one corresponding, they can be converted to IP, it is much simpler.

Second, Hashing

Scope of application: Quick Find, delete the basic data structure, usually need the total amount of data can be put into memory

Basic principles and key points:

hash function selection, for strings, integers, permutations, the specific corresponding hash method.

Collision treatment, one is open hashing, also known as Zipper method, the other is closed hashing, also known as the Address law, opened addressing.

Extended:

D-left hashing in D is a number of meanings, we first simplify this problem, take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is calculated with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point you need to check the H1[key] position in the T1 and the H2[key] position in the T2, which location has been stored (collision) key more, and then store the new key in a low-load location. If the two sides are the same, for example, two positions are empty or all store a key, the new key is stored in the left T1 sub-table, 2-left also come. When looking for a key, you must make a hash of two times and find two positions.

Problem Example:

1). Massive log data, extract the most visited Baidu one day the most number of that IP.

The number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then statistics.

Third, Bit-map

Scope of application: The data can be quickly found, the weight, delete, generally speaking, the data range is 10 times times the size of int

Fundamentals and key points: use bit arrays to indicate whether certain elements exist, such as 8-digit phone numbers

Extension: Bloom filter can be seen as an extension to Bit-map

Problem Example:

1) A file is known to contain some phone numbers, each number is 8 digits, the number of different numbers are counted.

8-bit up to 99 999 999, approximately 99m bits, about 10 m bytes of memory.

2) 250 million integers to find out the number of distinct integers, not enough memory space to accommodate these 250 million integers.

Extend the Bit-map, use 2bit to represent a number, 0 means no, 1 means one time, 2 means 2 times or more. Or we do not use 2bit to make representations, we can simulate the implementation of this 2bit-map with two bit-map.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.