Summary of large data volumes and massive data processing methods

Source: Internet
Author: User

The following method is a general summary of the massive data processing methods. Of course, these methods may not completely cover all the problems, however, such methods can basically deal with the vast majority of problems encountered. The following questions are basically from the company's interview test questions. The method is not necessarily the best. If you have a better solution, please discuss them with me.

1. Bloom filter

Applicability: it can be used to implement a data dictionary, determine the data weight, or calculate the intersection of data sets.

Basic principles and key points:
The principle is very simple, with a Bit Array + k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. If we find that all the corresponding bits of the hash function exist as 1, it is obvious that this process does not guarantee that the search result is 100% correct. At the same time, a inserted keyword cannot be deleted, because the bit corresponding to this keyword affects other keywords. Therefore, a simple improvement is the counting Bloom filter, which can be deleted by replacing the bitwise array with a counter array.

Another important question is how to determine the size of the Bit Array m and the number of hash functions based on the number of input elements n. When the number of hash functions is k = (ln2) * (m/n), the error rate is the minimum. If the error rate is not greater than E, m must at least be equal to n * lg (1/E) to represent a set of any n elements. But m should be larger, because at least half of the bit array should be 0, then m should be> = nlg (1/E) * lge is probably nlg (1/E) 1.44 times (lg represents the base 2 logarithm ).

For example, if the error rate is 0.01, then m should be 13 times that of n. In this case, k is about 8.

Note that the unit of m is different from that of n, m is bit, and n is the unit of the number of elements (accurately speaking, the number of different elements ). Generally, the length of a single element is many bits. Therefore, the use of bloom filter memory is usually saved.

Extension:
The Bloom filter maps the elements in the set to an array. If k (k is the number of Hash Functions) ing bits are all 1, it indicates that the element is not in this set. Counting bloom filter (CBMs) extends each bit in the bit array to a counter, which supports the deletion of elements. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the Set element. SBF uses the minimum value in counter to represent the occurrence frequency of elements.

Example of the problem: give you two files A and B, each containing 5 billion URLs, each occupying 64 bytes, the memory limit is 4 GB, let you find, the URL of file B. What if there are three or even n files?

Based on this problem, we calculate the memory usage. 4G = 2 ^ 32 is about 4 billion * 8 is about 34 billion, n = 5 billion, if the error rate is 0.01, 65 billion bits are required. Currently, 34 billion is available, and there are not many differences. This may increase the error rate. In addition, if these URLs correspond one-to-one, you can convert them into ip addresses, which is much simpler.

2. Hashing

Applicability: Quick Search and deletion of the basic data structure, which usually requires the total data volume to be stored in the memory.

Basic principles and key points:
Hash function Selection, for strings, integers, arrangement, specific hash method.
For collision processing, one is open hashing, also known as the zipper method, and the other is closed hashing, also known as the open address method and opened addressing.

Extension:
D in d-left hashing refers to multiple meanings. Let's first simplify this problem and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of the same length, namely T1 and T2, and configuring a hash function, h1 and h2 for T1 and T2 respectively. When a new key is stored, two hash functions are used for calculation to obtain the addresses h1 [key] and h2 [key]. In this case, you need to check the h1 [key] location in T1 and the h2 [key] location in T2. Which location has been stored (with collision) and there are many keys, store the new key in a location with less load. If the two sides are as many as one, for example, if both locations are empty or both of them store a key, the new key is stored in the T1 subtable on the left, and 2-left is also stored. When searching for a key, you must perform two hashes and query both locations at the same time.

Problem example:
1) extract the IP address with the most visits to Baidu on a certain day from massive log data.

The number of IP addresses is still limited. A maximum of 2 ^ 32 ip addresses are allowed. Therefore, you can use hash to directly store IP addresses in the memory for statistics.

3. bit-map

Applicability: You can quickly search, judge, and delete data. Generally, the data range is less than 10 times that of int.

Basic principle and key points: Use a bit array to indicate whether some elements exist, such as 8-digit phone numbers.

Extension: the bloom filter can be seen as an extension of bit-map.

Problem example:

1) it is known that a file contains some phone numbers, each of which is 8-digit and the number of different numbers is counted.

The maximum size of 8 bits is 99 999 999, which requires about 99 m bits and about 10 m bytes of memory.

2) The number of non-repeated integers in the 0.25 billion integers. The memory space is insufficient to accommodate these 0.25 billion integers.

Extend bit-map and use 2 bits to represent a number. 0 indicates not to appear, 1 indicates to appear once, and 2 indicates to appear twice or more. Or we can use two bit-maps to simulate 2bit-map.

4. Heap

Applicability: The first n of massive data, and n are relatively small. The heap can be placed into the memory.

Basic principle and key points: the first n values of the maximum heap are smaller, and the first n values of the minimum heap are greater. Method. For example, if the first n is small, we can compare the current element with the largest element in the max heap. If it is smaller than the largest element, we should replace the largest element. In this way, the last n elements are the smallest n. It is suitable for large data volumes, where the first n is small and the n is small. This allows you to scan the data to obtain all the first n elements, which is highly efficient.

Extended: Dual heap. A maximum heap is combined with a minimum heap and can be used to maintain the median.

Problem example:
1) Find the largest number of first 100 in.

Use a minimum heap of 100 elements.

5. Double Bucket Division

Applicability: k-th largest, median, non-repeating or repeated numbers

Basic principle and key points: Because the element range is large and direct addressing tables cannot be used, the scope is gradually determined through multiple division and finally implemented within an acceptable range. It can be reduced multiple times. The double layer is just an example.

Expansion:

Problem example:
1). Find the number of non-repeated integers in the 0.25 billion integers. The memory space is insufficient to accommodate these 0.25 billion integers.

It is a bit like the Pigeon nest principle. The number of integers is 2 ^ 32, that is, we can set the number of 2 ^ 32, it is divided into 2 ^ 8 regions (for example, a single file represents a region), and data is separated to different regions. Then bitmap can be used in different regions for direct solution. That is to say, as long as there is enough disk space, it can be easily solved.

2). Find the median of the 0.5 billion int.

This example is more obvious than the one above. First, we divide int into 2 ^ 16 regions, and then read the data to calculate the number of data falls into each region. Then, we can determine which region the median falls into based on the statistical results, at the same time, we know that the maximum number in this region is the median. Then we can only count the number of items that fall into this area for the second scan.

In fact, if the int type is not int64, we can reduce it to an acceptable level after three such division. That is, you can first divide int64 into 2 ^ 24 regions, then determine the maximum number of regions, and divide the region into 2 ^ 20 sub-regions, then determine the number of the subregion, and then the number of the subregion is only 2 ^ 20, you can directly use direct addr table for statistics.

6. Database Indexes

Applicability: add, delete, modify, and query of large data volumes

Basic principles and key points: use the data design implementation method to add, delete, modify, and query massive data.
Extension:
Problem example:


7. Inverted index (Inverted index)

Applicability: search engine and keyword Query

Basic principle and key points: Why is inverted index? An index method is used to store the ing of a word stored in a document or a group of documents under full-text search.

Take English as an example. The text to be indexed is as follows:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
We can get the following reverse file index:
"A": {2}
"Banana": {2}
"Is": {0, 1, 2}
"It": {0, 1, 2}
"What": {0, 1}
The search criteria "what", "is", and "it" correspond to the intersection of the set.

A forward index is developed to store a list of words in each document. The query of forward indexes often satisfies the needs of the full-text query of each document in an orderly and frequent order and the validation of each word in the validation document. In forward indexes, a document occupies a central position. Each document points to a sequence of index items it contains. That is to say, the document points to the words it contains, while the reverse index points to the documents containing the words, so it is easy to see the reverse relationship.

Extension:

Example of a problem: the document retrieval system queries the files that contain certain words, such as keyword searches for common academic papers.

8. External sorting

Applicability: Sorting big data and removing duplicates

Basic principles and requirements: The merge method of external sorting, the principle of replacing and selecting the loser tree, and the Optimal Merge tree

Extension:

Problem example:
1). There is a 1 GB file with each row containing a word. The word size cannot exceed 16 bytes, and the memory size is limited to 1 MB. Returns the top 100 words with the highest frequency.

This data has obvious characteristics. The word size is 16 bytes, but the memory is only 1 MB for hash, so it can be used for sorting. Memory can be used as the input buffer.

9. trie tree

Applicable scope: large data volume, large data duplication, but small data types can be put into memory

Basic principle and key points: Implementation Mode and node-child expression mode

Extension: compress the current data.

Problem example:
1) there are 10 files, each of which is 1 GB. each row of each file stores the user's query, and the query of each file may be repeated. Sort the query frequency.

2). 10 million character strings, some of which are the same (repeated). You need to remove all repeated strings and keep the strings that are not repeated. How can I design and implement it?

3). Search for hot queries: the query string has a high repeat level. Although the total number is 10 million, if the number of duplicate queries is not more than 3 million, each query must not exceed 255 bytes.

10. Distributed Processing mapreduce

Applicability: large data volumes, but small data types can be stored in memory

Basic principles and key points: Hand over data to different machines for processing, data division, and result reduction.

Expansion:

Problem example:

1). The canonical example application of MapReduce is a process to count the appearances

Each different word in a set of documents:
Void map (String name, String document ):
// Name: document name
// Document: document contents
For each word w in document:
EmitIntermediate (w, 1 );

Void reduce (String word, Iterator partialCounts ):
// Key: a word
// Values: a list of aggregated partial counts
Int result = 0;
For each v in partialCounts:
Result + = ParseInt (v );
Emit (result );
Here, each document is split in words, and each word is counted initially with a "1" value

The Map function, using the word as the result key. The framework puts together all the pairs

With the same key and feeds them to the same call to Reduce, thus this function just needs

Sum all of its input values to find the total appearances of that word.

2). Massive Data is distributed in 100 minds, and we want to find a way to efficiently count the top 10 of the data.

3). A total of N machines, each with N numbers. Each machine can store a maximum of O (N) numbers and operate on them. How do I find the median of N ^ 2 numbers )?


Classic Problem Analysis

Tens of millions or hundreds of millions of data records (repeated data records) are used to count the first N data records that appear the most frequently. The data can be read into the memory at a time and cannot be read at a time.

Available ideas: trie tree + heap, database index, grouping subset statistics, hash, distributed computing, approximate statistics, external sorting

Whether or not the memory can be read at one time is actually the amount of data that is duplicated. If the de-duplicated data can be put into the memory, we can create a dictionary for the data, such as through map, hashmap, trie, and then directly conduct statistics. Of course, when updating the number of occurrences of each piece of data, we can use a heap to maintain the first N data records with the most occurrences. Of course, this increases the number of maintenance records, it is not as efficient to calculate the first N after full statistics.

If the data cannot be stored in the memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation. The possible change is to store the dictionary on the hard disk rather than the memory, which can be referred to the database storage method.

Of course, there is also a better way to use distributed computing, basically the map-reduce process. First, you can use the data value or the value after the data hash (md5, data is divided into different machines by range. It is recommended that the data be divided and then read into the memory at a time, so that different machines are responsible for processing various numerical ranges, which are actually map. After the results are obtained, each machine only needs to extract the first N data records with the most occurrences, and then summarize and select the first N data records with the most occurrences, this is actually the reduce process.

In fact, you may want to directly divide the data into different machines for processing, So that you cannot get the correct solution. Because one data may be evenly distributed to different machines, while the other may be fully clustered to one machine, and there may be data with the same number. For example, if we want to find the first 100 servers with the most occurrences, we will distribute 10 million of the data to 10 servers and find the first 100 servers with the most occurrences, after merging, we cannot ensure that we can find the actual number of 100th, because, for example, the maximum number of 100th may be 10 thousand, but it is divided into 10 machines, in this way, there are only one thousand machines on each platform. Assume that those machines ranked before 1000 are separately distributed on one machine, for example, there are 1001 machines, in this way, the first 10 thousand servers will be eliminated. Even if we allow each server to select the most frequently used 1000 servers and merge them again, an error will still occur, because there may be a large number of 1001 aggregation. Therefore, data cannot be evenly distributed to different machines. Instead, the hash values are mapped to different machines for processing, so that different machines can process a value range.

The external sorting method consumes a lot of IO, and the efficiency is not very high. The above distributed method can also be used for standalone versions, that is, dividing the total data into multiple different sub-files based on the value range, and then processing them one by one. After processing, merge the Words and Their occurrence frequency. In fact, an external sorting merge process can be used.

In addition, we can also consider approximate computing, that is, we can combine the natural language attributes to only take the words that actually appear most frequently as a dictionary, so that this scale can be put into memory.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.