ArticleDirectory

 Preface
 Part 1 and 15 interview questions on massive data processing
 Part 2: BTImap for Massive Data Processing
Author: Xiaoqiao Journal, redfox66, and July.
Preface
This blog once sorted out 10 questions about massive data processing (ten questions about massive data processing and a summary of ten methods). Besides repeating the previous 10 questions, there were 7 more records. For your reference only.
At the same time,ProgramMember programming Art SeriesThe creation will be resumed. Part of the source of the questions after Chapter 1 will be taken from 17 Interview Questions about massive data processing in the following section. Because, we feel that each of the following interview questions is worth rethinking and studying again. In addition, the first ten chapters of the programming Art series also come. If you have any questions or suggestions, please leave it blank. Thank you.
Part 1 and 15 interview questions on massive data processing
1. Given two files a and B, each of them stores 5 billion URLs. Each URL occupies 64 bytes and the memory limit is 4 GB. Can you find the common URLs of files a and B?
Solution 1: it can be estimated that the size of each file is 50 GB × 64 = 320 GB, far greater than the memory limit of 4 GB. Therefore, it is impossible to fully load it into the memory for processing. Consider a divideandconquer approach.
 Traverse file a, obtain each URL, and store the URL to the 1000 small files (marked as) based on the obtained values. In this way, the size of each small file is about 300 MB.
 Traverse file B and store the URL in the same way as ATO 1000 small files (recorded ). After such processing, all the URLs that may be the same are in the corresponding small file (), and noncorresponding small files cannot have the same URL. Then we only need to find the same URL in the 1000 pairs of small files.
 You can store the URL of a small file in hash_set for the same URL in each pair of small files. Then traverse each URL of another small file to see if it is in the hash_set just constructed. If it is, it is a common URL and saved in the file.
Solution 2: if a certain error rate is allowed, you can use the bloom filter. The 4G memory can be approximately 34 billion bits. Map the URLs in one file to the 34 billion bits using the bloom filter, read the URLs of another file one by one, and check whether the URLs are consistent with the bloom filter. If yes, the URL should be a common URL (note that there will be a certain error rate ).
Reader feedback@Crowgns:
 Determine the size of each file after hash. If the hash is not balanced and there are large files, continue to hash the files and change the hash.AlgorithmFor the second time, we split large files until there were no large files. In this way, the file label can be represented by a A12 (the first hash number is 1, the file is large, so participate in the second hash, number is 2)
 Because 1 exists, if there is a large file in hash for the first time, the set method cannot be used directly. We recommend that you sort each file in the natural order of strings, and then have the same hash number (for example, all files are 13, but not a number is 1, B Numbers are 11 and 12), which can be compared directly from start to end. For hierarchical inconsistencies, for example, A1 and B have 1222. The hierarchical shortest must be compared with each hierarchical file, to confirm each of the same Uris.
2. There are 10 files, each of which is 1 GB. each row of each file stores the user's query, and the query of each file may be repeated. Sort the query frequency.
Solution 1:
 Read 10 files in sequence and write the query to the other 10 files according to the hash (query) % 10 results. In this way, the size of each newly generated file is about 1 GB (assuming that the hash function is random ).
 Find a machine with around 2 GB of memory and use hash_map (query, query_count) to calculate the number of times each query appears. Sort by the number of occurrences by means of fast/Heap/Merge Sorting. Output the sorted query and the corresponding query_cout to the file. In this way, 10 sorted files (marked as) are obtained ).
 Merge and sort the 10 files (combining inner and outer sorting ).
Solution 2:
Generally, the total number of queries is limited, but the number of repetitions is large. For all queries, you can add them to the memory at one time. In this way, we can use the trie tree, hash_map, and so on to directly count the number of occurrences of each query, and then perform fast/Heap/Merge Sorting based on the number of occurrences.
(Reader feedback@: In the second example of the original article, "Find a machine with around 2 GB of memory and use hash_map (query, query_count) to calculate the number of times each query appears ." Because the query will be repeated, hash_multimap should be used as the key.Hash_map does not allow duplicate keys. @Hywangw: as described by the store AdministratorIt must be wrong. hash_map (query, query_count) is used to count the number of occurrences of each query, not to store their values. What can I do if I use multimap to count + 1 at a time? ThanksHywangw).
Solution 3:
Similar to solution 1, but after hash is completed and divided into multiple files, it can be handed over to multiple files for processing, using a distributed architecture (such as mapreduce), and then merged.
3. There is a 1 GB file with each row containing a word. The word size cannot exceed 16 bytes, and the memory size is limited to 1 MB. Returns the top 100 words with the highest frequency.
Solution 1: read each word X in an ordered file and save it to 5000 small files (marked as) based on this value. In this way, each file is about KB. If the size of some files exceeds 1 MB, you can continue to split the files in a similar way until the size of the small files obtained by decomposition does not exceed 1 MB. For each small file, count the words in each file and the corresponding frequency (trie tree/hash_map can be used ), and take out the 100 words with the maximum frequency (the minimum heap containing 100 nodes can be used), and save the 100 words and the corresponding frequency to the file, thus obtaining 5000 files. The next step is to merge the 5000 files (similar to the merge and sort files.
4. Extract the IP address with the most visits to Baidu on a certain day with massive log data.
Solution 1: the first is this day, and the IP addresses in the logs accessing Baidu are obtained and written to a large file one by one. Note that the IP address is a 32bit IP address with a maximum of 2 ^ 32 IP addresses. You can also use the ing method, such as modulo 1000, to map the entire large file to 1000 small files, and then find the IP address with the highest frequency in each small text (hash_map can be used for frequency statistics, then find out the maximum frequency) and the corresponding frequency. Then, among the 1000 largest IP addresses, find the IP address with the highest frequency, that is, what you want.
5. Locate nonrepeated integers among the 0.25 billion integers. The memory is insufficient to accommodate these 0.25 billion integers.
Solution 1: Use 2Bitmap (2bit for each number, 00 indicates no, 01 indicates one occurrence, 10 indicates multiple times, and 11 indicates meaningless, the total memory size is 2 ^ 32*2 bit = 1 GB memory, which is acceptable. Then scan the 0.25 billion Integers to check the corresponding bits in bitmap. If 00 is changed to, 10 remains unchanged. After the descriptions are completed, view the bitmap and output the corresponding digit as an integer of 01.
Solution 2: You can use similar methods to divide small files. Then, find the nonrepeated integers in the small file and sort them. Then merge the elements to remove them.
6. massive amounts of data are distributed in 100 minds, and we can find a way to efficiently count the top 10 of this batch of data.
Solution 1:
 Find the top 10 items on each computer and use a heap containing 10 elements (top 10 items are small, with the maximum heap, top 10 items are large, and the minimum heap is used ). For example, if we want to increase the top 10, we should first adjust the first 10 elements to the smallest heap. If we find the top 10 elements, we will then scan the subsequent data and compare it with the heap top elements. If it is larger than the top 10 elements, use this element to replace the heap top and then adjust it to the minimum heap. The final heap element is the top 10.
 After finding the top 10 on each computer, we can combine the top 10 on the 100 computers with a total of 1000 pieces of data. Then we can use the above similar method to find the top 10.
(For more information, see:Chapter 3: finding the minimum k Number, And Chapter 3,Top K algorithm problems)
Reader feedback@Qinleopard:
In the 6th question method, isn't the top 10 items on each computer sure to include the top 10 items with the highest final frequency?
For example, in the first file: A (4), B (5), C (6), D (3)
In the second file: A (4), B (5), C (3), D (6)
In the third file: A (6), B (5), C (4), D (3)
If you want to select top (1), the result is a, but the result is B.
@ July: I think this reader may not have made a clear proposal. In this questionTop 10 refers to the maximum number of 10Instead of the 10 most frequent occurrences. However, if you want to find the 10 most frequently accessed IP addresses that are the same as the 10 most frequently accessed IP addresses4Question. It is hereby stated.
7. How can I find the most repeated data?
Solution 1: First hash, then map the modulo to a small file, find the most repeated one in each small file, and record the number of repetitions. Find out the most repeated data in the previous step (For details, refer to the previous question ).
8. Tens of millions or hundreds of millions of data records (with duplicates) are collected to calculate the n data records with the most frequent occurrences.
Solution 1: Data of tens of millions or hundreds of millions can be stored in the memory of the current machine. Therefore, we recommend that you use hash_map/binary tree search/red/black tree to calculate the number of times. Then we can retrieve the first n data records that appear most frequently. We can use the heap mechanism mentioned in question 6th.
9. 10 million strings, some of which are repeated. You need to remove all the duplicates and keep the strings that are not repeated. How can I design and implement it?
Solution 1: it is more appropriate to use the trie tree, And hash_map should also work.
10. A text file contains about 10 thousand rows and one word per line. The first 10 words that most frequently appear must be counted. Please give your thoughts and analyze the time complexity.
Solution 1: consider time efficiency. Use the trie tree to count the number of times each word appears. The time complexity is O (n * le) (Le indicates the word's level length ). Then we can find out the first 10 words that appear most frequently. We can use the heap method. As mentioned in the previous question, the time complexity is O (n * lg10 ). Therefore, the total time complexity is the greater of O (N * le) and O (N * lg10.
11. Find the first 10 frequentlyseen words in a text file. However, this file is long and may contain hundreds of millions of lines or billions of lines. In short, it is impossible to read the memory at a time and ask the optimal solution.
Solution 1: first, based on hash and modulo, the file is divided into multiple small files. For a single file, use the above method to find the 10 most frequentlyseen words in each file. Then merge to find the 10 most frequentlyseen words.
12. Find the maximum number of 100 in.
 Solution 1: Adopt the local elimination method. Select the first 100 elements and sort them as sequence L. Then, the remaining element x is scanned at a time to compare with the smallest element in the 100 elements in the sorted order. If it is larger than the smallest element, delete the smallest element, and insert X into the sequence l using the insert sorting idea. It cyclically scans all elements. The complexity is O (100 W * 100 ).
 Solution 2: the idea of fast sorting is adopted. After each split, only the portion larger than the axis is considered. When the portion larger than the axis is more than 100, the traditional sorting algorithm is used for sorting, the first 100. The complexity is O (100 W * 100 ).
 Solution 3: we have mentioned in the previous question that a minimum heap containing 100 elements is used. The complexity is O (100 W * lg100 ).
13. Search for popular queries:
The search engine records all the search strings used for each search using log files. The length of each query string is 1bytes. Suppose there are currently 10 million records, and these query strings have a relatively high number of repeated reads. Although the total number is 10 million, the number of duplicate reads cannot exceed 3 million if the number of duplicate reads is removed. The higher the repetition of a query string, the more users query it, and the more popular it is. Please count the top 10 query strings. The memory required cannot exceed 1 GB.
(1) Describe your solution to this problem;
(2) provide the main processing procedures, algorithms, and complexity of algorithms.
Solution 1: The trie tree is used, and the keyword field stores the number of times that the query string appears, not 0. At last, we sorted the occurrence frequency with the minimum push of 10 elements.
For detailed answers to this question, refer to section 3.1 of this article:Chapter 3 continued: Implementation of top K algorithm problems.
14. A total of N machines, each with N numbers. Each machine can store a maximum of O (n) numbers and operate on them. How do I find the number in N ^ 2?
Solution 1: First estimate the range of these numbers. For example, assume that these numbers are all 32bit unsigned integers (2 ^ 32 in total ). We divide the integers 0 to 2 ^ 321 into N range segments. Each segment contains (2 ^ 32)/n integers. For example, the first field 0 to 2 ^ 32/N1, the second segment is (2 ^ 32)/n to (2 ^ 32)/N1 ,..., The Nth segment is (2 ^ 32) (N1)/n to 2 ^ 321. Then, scan the number of N on each machine and place the number in the first segment on the first machine. Put the number in the second segment on the second machine ,..., Place the number of the nth segment on the nth machine. Note that the number stored on each machine in this process should be O (n. Next, we will calculate the number of each machine in sequence and accumulate the number at a time until the k machine is found. The accumulated number on this machine is greater than or equal to (n ^ 2)/2, the sum on the second K1 machine is less than (N ^ 2)/2, and this number is counted as X. Then, the median we are looking for is in the (N ^ 2)/2x position of the k machine. Then we sort the number of K machines and find the number (N ^ 2)/2x, that is, the complexity of the median is O (n ^ 2.
Solution 2: sort the numbers on each machine first. After sorting, we use the thought of merging and sorting to merge the numbers on the N machines to get the final sorting. Find the second (N ^ 2)/second to ask for it. The complexity is O (n ^ 2 * lgn ^ 2.
15. Maximum gap problem
Given n real numbers, the maximum difference between n real numbers on the real axis between the number of vectors 2 requires a linear time algorithm.
Solution 1: the first method that comes to mind is to sort the N pieces of data first, and then scan them again to determine the maximum adjacent gaps. However, this method cannot meet the requirements of linear time. Therefore, use the following method:
 Find the largest and smallest data Max and min in n data.
 Use the N2 points of the same interval [min, Max], the [min, Max] is divided into n1 intervals (before and after the opening interval), These intervals as the bucket, number, the upper bound of bucket I is the same as that of bucket I + 1, that is, the size of each bucket is the same. The size of each bucket is :. In fact, the boundary of these buckets forms an arithmetic difference sequence (the first item is min, and the tolerance is). It is considered that Min is placed in the first bucket, and Max is placed in the n1 bucket.
 Place n numbers in n1 buckets: Assign each element x [I] to a bucket (number: index), and obtain the maximum and minimum data allocated to each bucket.
 Maximum gap: in addition to the maximum and minimum data Max and Min N2 data into n1 barrels, the principle of the drawer shows that at least one bucket is empty, because each bucket has the same size, the maximum gap does not appear in the same bucket. It must be the gap between the upper bound of a bucket and the lower bound of a bucket in the climate, and the buckets between the bins (even if the connection is good) must be empty buckets. That is to say, the maximum gap produces j> = I + 1 between the upper bound of bucket I and the lower bound of Bucket J. Scan once.
16. Merge multiple sets into a set without Intersection
A set of strings. The format is as follows:. Merging a set whose intersection is not empty requires that there be no intersection between the merged sets. For example, the preceding example should be output.
(1) Describe your solution to this problem;
(2) provides the main processing procedures, algorithms, and complexity of algorithms;
(3) describe possible improvements.
Solution 1: Use and query sets. First, all the strings are in the separate query set. Then, the two adjacent elements are merged sequentially according to the scanning of each set. For example, first check whether AAA and BBB are in the same and check the set. If not, check the set where AAA and BBB are located and, then, check whether the BBB and CCC are in the same and check the set. If they are not, check the set where they are located. Next, scan other sets. When all the sets are scanned and the set is queried. The complexity should be O (nlgn. For improvement, You can first record the root node of each node to improve the query. When merging, you can combine large and small, which also reduces complexity.
17. Maximum subsequence and maximum submatrix Problems
Maximum subsequence of an array: if an array is given, the elements have both positive and negative values, and find a continuous subsequence to make and maximum.
Solution 1: this problem can be solved through dynamic planning. If B [I] is set to indicate the largest subsequence ending with element a [I], then it is clear. This can be quickly implemented using code.
Maximum submatrix problem: Given a matrix (twodimensional array), where the data is large and small, please find a submatrix to make the sum of the submatrix largest, and output this sum.
Solution 2: it can be solved using the same idea as the largest subsequence. If we determine the element between column I and column J, it is actually a maximum subsequence problem in this range. How to determine column I and column J can be searched by brute force.
Part 2: BTImap for Massive Data Processing
Bloom filter is already in the previous articleBloom filter for Massive Data ProcessingThis article focuses on Bitmap. If you have any questions, please let me know.
What is bitmap?
Bitmap uses a bit to mark the value corresponding to an element, and the key is the element. Because bit is used to store data, the storage space can be greatly reduced.
If we haven't understood what bitmap is, let's look at a specific example. Let's assume that we want to take 5 elements (, 3) from 0 to 7) sort (Here we assume these elements are not repeated ). Then we can use the bitMap Method for sorting. To represent the number of 8 bits, we only need 8 bits (1 bytes). First, we need to open up a 1 byte space and set all bits in these spaces to 0 (for example :)
Then traverse these five elements. First, the first element is 4, and then the corresponding position of 4 is 1 (you can operate P + (I/8) in this way)  (0 × 01 <(I % 8) Of course, the operations here involve bigending and littleending. The default value is bigending ), because it starts from scratch, we need to set the fifth position to one (for example ):
Then, process the second element 7, set the eighth position to 1, and then process the third element until all the elements are processed, and set the corresponding position to 1, at this time, the bit status of the memory is as follows:
Then, we traverse the bit area and output the numbers (, 7) of the bit, so as to sort the bit. The followingCodeThe usage of a bitmap is given: sorting.

 // Defines that each byte has eight bits

 # Include <memory. h>

 # Define bytesize 8
 VoidSetbit (Char* P,IntPosi)

 {

 For(IntI = 0; I <(posi/bytesize); I ++)

 {

 P ++;

 }


 * P = * p  (0x01 <(posi % bytesize ));// Assign this bit to 1
 Return;

 }


 VoidBitmapsortdemo ()

 {

 // For simplicity, we do not consider negative numbers.

 IntNum [] = };


 // The bufferlen value is determined based on the maximum value of the data to be sorted.
 // The maximum value to be sorted is 14. Therefore, only two bytes (16 bits) are required)

 .

 Const IntBufferlen = 2;

 Char* Pbuffer =New Char[Bufferlen];


 // Set all bits to 0; otherwise, the result is unpredictable.
 Memset (pbuffer, 0, bufferlen );

 For(IntI = 0; I <9; I ++)

 {

 // First, set the corresponding bit to 1

 Setbit (pbuffer, num [I]);

 }


 // Output the sorting result
 For(IntI = 0; I <bufferlen; I ++)// Process one byte each time)

 {

 For(IntJ = 0; j <bytesize; j ++)// Process each bit in this byte

 {

 // Determine whether the bit is 1 and output the result. The result is stupid.
 // Obtain the mask (0x01 <j) for the Jbit.

 // Bit and the mask. Finally, determine whether the mask and the processed

 // The result is the same.

 If(* Pbuffer & (0x01 <j) = (0x01 <j ))

 {
 Printf ("% D", I * bytesize + J );

 }

 }

 Pbuffer ++;

 }

 }


 Int_ Tmain (IntArgc, _ tchar * argv [])
 {

 Bitmapsortdemo ();

 Return0;

 }
You can quickly search, judge, and delete data. Generally, the data range is less than 10 times that of Int.
Basic principles and key points
The bit array is used to indicate whether some elements exist, such as eight phone numbers.
Extension
Bloom filter can be seen as an extension of bitmap (for more information about bloom filter, see: Massive Data ProcessingBloom FilterDetails ).
Problematic instance
1) it is known that a file contains some phone numbers. Each number is an 8digit number, and the number of different numbers is counted.
The maximum size of 8 bits is 99 999 999, which requires about 99 m bits and about 10 M bytes of memory. (It can be understood as a number ranging from 099 999 to 999. Each number corresponds to a bit, so only 99m bits = 1.2 Mbytes are required. In this way, it uses a small memory of about MB to represent all 8digit phones)
2) The number of nonrepeated integers in the 0.25 billion integers. The memory space is insufficient to accommodate these 0.25 billion integers.
Extend the bitmap function and use 2 bits to represent a number. 0 indicates that the number does not appear. 1 indicates that the number appears once. 2 indicates that the number appears twice or more. When traversing these numbers, if the value of the corresponding position is 0, it is set to 1; if it is 1, it is set to 2; if it is 2, it remains unchanged. Or we can use two bitmaps to simulate the 2bitmap. This is the same.