Introduce the processing method of massive data

Source: Internet
Author: User
Tags repetition

Scope of application: can be used to implement the data dictionary, the data of the weight, or set to find the intersection

Basic principles and key points:
For the principle is very simple, bit array +k a separate hash function. The bit array of the value corresponding to the hash function is set to 1, and if it is found that all the corresponding bits of the hash function are 1, it is clear that this process does not guarantee that the result of the lookup is 100% correct. It is also not supported to delete a keyword that has already been inserted, because the bit that corresponds to the keyword affects other keywords. So a simple improvement is counting Bloom filter, which can support deletion by replacing the bit array with a counter array.

There is a more important question, how to determine the size of the bit array m and the number of hash functions according to the number of input elements N. The error rate is minimized when the number of hash functions is k= (LN2) * (m/n). In cases where the error rate is not greater than E, M must be at least equal to N*LG (1/e) to represent a collection of any n elements. But M should also be larger, because the bit array is also guaranteed to be at least half 0, then M should >=NLG (1/e) *lge is probably NLG (1/e) 1.44 times times (LG represents 2 logarithm).

For example, we assume that the error rate is 0.01, then M should be about 13 times times the N. So k is probably 8.

Note that M is different from N's units, M is bit, and N is the number of elements (exactly the number of different elements). The length of a single element is usually a lot of bits. So the use of Bloom filter memory is usually saved.

Extended:
Bloom filter maps the elements in the collection into the array, with K (k for the hash function number), whether all 1 indicates that the element is not in this set. Counting Bloom Filter (CBF) expands each bit in the bit array to a counter, enabling the deletion of the element. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in counter to approximate how often the element appears.

Problem Example: give you a, b two files, each store 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4G, let you find the common URL of a, b file. What if it's three or even n files?

According to this problem we calculate the use of memory, 4G=2^32 is about 4 billion *8 is probably 34 billion, N=50 billion, if the error rate 0.01 is required is probably 65 billion bit. Now available is 34 billion, the difference is not much, this may cause the error rate to rise some. In addition, if these urlip are one by one corresponding, they can be converted to IP, it is much simpler.

2.Hashing

Scope of application: Quick Find, delete the basic data structure, usually need the total amount of data can be put into memory

Basic principles and key points:
hash function selection, for strings, integers, permutations, the specific corresponding hash method.
Collision treatment, one is open hashing, also known as Zipper method, the other is closed hashing, also known as the Address law, opened addressing.

Extended:
D-left hashing in D is a number of meanings, we first simplify this problem, take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is calculated with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point you need to check the H1[key] position in the T1 and the H2[key] position in the T2, which location has been stored (collision) key more, and then store the new key in a low-load location. If the two sides are the same, for example, two positions are empty or all store a key, the new key is stored in the left T1 sub-table, 2-left also come. When looking for a key, you must make a hash of two times and find two positions.

Problem Example:
1). Massive log data, extract the most visited Baidu one day the most number of that IP.

The number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then statistics.

3.bit-map

Scope of application: The data can be quickly found, the weight, delete, generally speaking, the data range is 10 times times the size of int

Fundamentals and key points: use bit arrays to indicate whether certain elements exist, such as 8-digit phone numbers

Extension: Bloom filter can be seen as an extension to Bit-map

Problem Example:

1) A file is known to contain some phone numbers, each number is 8 digits, the number of different numbers are counted.

8-bit up to 99 999 999, approximately 99m bits, about 10 m bytes of memory.

2) 250 million integers to find out the number of distinct integers, not enough memory space to accommodate these 250 million integers.

Extend the Bit-map, use 2bit to represent a number, 0 means no, 1 means one time, 2 means 2 times or more. Or we do not use 2bit to make representations, we can simulate the implementation of this 2bit-map with two bit-map.

4. Heap

Scope of application: mass data before n large, and n is small, heap can be put into memory

Basic principle and main points: the largest heap to find the first n small, the smallest heap before n large. method, such as finding the first n small, we compare the current element with the largest element in the largest heap, and if it is smaller than the maximum element, then the largest element should be replaced. So the last n elements are the smallest n. Suitable for large data volume, the first n small, the size of n is relatively small case, so that you can scan again to get all the first n elements, high efficiency.

Extension: A double heap, a maximum heap combined with a minimum heap, can be used to maintain the median.

Problem Example:
1) The maximum number of the first 100 is found in the 100w number.

Use a minimum heap of 100 element sizes.

5. Double Barrel Division

Scope of application: K-Large, median, non-repeating or repetitive numbers

Basic principle and key points: Because the element scope is very large, can not use the direct addressing table, so through multiple division, gradually determine the scope, and then finally within an acceptable range. Can be reduced by several times, the double layer is just an example.

Extended:

Problem Example:
1). 250 million integers to find out the number of distinct integers, memory space is not enough to accommodate these 250 million integers.

A bit like the pigeon nest principle, the whole number is 2^32, that is, we can divide this 2^32 number into 2^8 region (for example, with a single file to represent an area), and then separate the data into different regions, and then different areas can be directly solved by using bitmap. This means that as long as there is enough disk space, it can be easily solved.

2). 500 million int to find the median of them.

This example is more obvious than the one above. First we divide int into 2^16, and then we read the number of the numbers that fall into each region, and then we can judge the median by the statistical results, and know that the number of the numbers in this area is exactly the median. And then the second scan, we just count the numbers that fall in this area.

In fact, if it's not int is int64, we can go through 3 of these divisions to be reduced to acceptable levels. That is, the int64 can be divided into 2^24 areas, and then determine the number of regions, in the region into the 2^20 sub-region, and then determine the number of sub-region of the number of numbers, and then in the sub-region only 2^20, you can directly use Direct addr table statistics.

6. Database indexing

Scope of application: large data volume additions and deletions

The basic principle and key points: using the data design realization method, carries on the processing to the massive data deletion and modification.
Extended:
Problem Example:

7. Inverted indexes (inverted index)

Scope of application: Search engine, keyword query

Rationale and key points: why is it called an inverted index? An index method that is used to store the mapping of a word in a document or group of documents under a full-text search.

In English, for example, here is the text to be indexed:
T0 = "It Is it"
T1 = "What's It"
T2 = "It is a banana"
We can get the following reverse file index:
' A ': {2}
"Banana": {2}
"is": {0, 1, 2}
"It": {0, 1, 2}
"What": {0, 1}
Retrieve the condition "what", "is" and "it" will correspond to the intersection of the set.

The forward index is developed to store a list of words for each document. A forward-indexed query often satisfies a query that has an orderly and frequent full-text query for each document and a validation of each word in a validating document. In a forward index, the document occupies a central location, and each document points to a sequence of indexed items that it contains. This means that the document points to the words it contains, and the reverse index is the word that points to the document that contains it, and it is easy to see the inverse relationship.

Extended:

Problem Example: A document retrieval system that queries those files that contain a word, such as a keyword search for a common academic paper.

8. Sorting outside

Scope of application: Big Data sorting, deduplication

Basic principle and key points: The merging method of the outer sort, the substitution choice loser tree principle, the optimal merging tree

Extended:

Problem Example:
1). There is a 1G size of a file, inside each line is a word, the size of the word does not exceed 16 bytes, the memory limit size is 1M. Returns the highest frequency of 100 words.

This data has obvious characteristics, the size of the word is 16 bytes, but only 1m of memory is not enough to hash, so it can be used to sort. Memory can be used when the input buffer.

9.trie Tree

Scope of application: Large amount of data, repeat many, but small data type can be put into memory

Basic principles and key points: the way of realization, the expression of children in the node

Extension: Compression implementation.

Problem Example:
1). There are 10 files, each file 1G, each line of each file is stored in the user's query, each file can be repeated query. I want you to sort by the frequency of the query.

2). 10 million strings, some of which are the same (repeat), need to remove all duplicates, and keep no duplicate strings. How to design and implement?

3). Search for popular queries: The query string has a high degree of repetition, although the total is 10 million, but if the repetition is removed, no more than 3 million, each less than 255 bytes.

10. Distributed Processing MapReduce

Scope of application: large data volume, but small data type can be put into memory

Basic principles and key points: the data is given to different machines for processing, data partitioning, and the results are normalized.

Extended:

Problem Example:

1). The canonical example application of MapReduce is a process to count the appearances of

Each different word in a set of documents:
void Map (string name, string document):
Name:document Name
Document:document contents
For each word W in document:
Emitintermediate (w, 1);

void reduce (String word, Iterator partialcounts):
Key:a Word
VALUES:A list of aggregated partial counts
int result = 0;
For each V in partialcounts:
Result + = parseint (v);
Emit (result);
Here, each document was split in words, and each word was counted initially with a ' 1″value by

The MAP function, using the word as the result key. The framework puts together all the pairs

With the same key and feeds them to the same call to Reduce, thus this function just needs to

Sum all of it input values to find the total appearances of this word.

2). Mass data distributed in 100 computers, think of a way to efficiently calculate the TOP10 of the data.

3). A total of n machines, with n number on each machine. The maximum number of O (N) per machine is saved and manipulated. How do I find the median number of n^2 (median)?

Classic problem Analysis

Tens of millions of or billions of data (there are duplicates), statistics of the most occurrences of the top n data, in two cases: can be read into the memory at once, not read in one time.

Available ideas: Trie tree + heap, database index, divided subsets of statistics, hash, distributed computing, approximate statistics, external sorting

The so-called ability to read into memory at one time should actually mean the amount of data removed after duplication. If the data can be put into memory after the deduplication, we can create a dictionary for the data, such as through Map,hashmap,trie, and then directly to the statistics. Of course, when updating the number of occurrences of each data, we can use a heap to maintain the most occurrences of the top N data, of course, this leads to an increase in maintenance times, rather than the full statistics after the first n high efficiency.

If the data cannot be put into memory. On the one hand we can consider whether the above dictionary method can be improved to adapt to this situation, can be changed is to store the dictionary on the hard disk, rather than memory, this can refer to the database storage methods.

Of course, there is a better way, that is, you can use distributed computing, basically is the map-reduce process, the first can be based on data values or data hash (MD5) after the value of the data according to the scope of the different machines, it is best to let the data can be divided into memory once, Such a different machine is responsible for dealing with various numerical ranges, which are actually maps. When the results are obtained, each machine simply takes out the top N data with the most occurrences, and then summarizes the top n data in all the data, which is actually the reduce process.

You might actually want to distribute the data directly to different machines for processing, so that you can't get the correct solution. Because one data may be split across different machines, while the other may be fully clustered on a single machine, there may be the same number of data. For example, we want to find the top 100 most occurrences, we will distribute 10 million of the data to 10 machines, find the top 100 of each occurrence of the most, after merging this does not guarantee to find the real 100th, because, for example, the most occurrences of the 100th may have 10,000, But it was divided into 10 machines, so there are only 1000 on each platform, assuming that these machines ranked 1000 before the those are distributed on a single machine, such as 1001, so that would have 10,000 of this will be eliminated, even if we let each machine selected With the most occurrences of the 1000 merges, there will still be an error, as there may be a large number of 1001 occurrences of the aggregation. Therefore, the data can not be randomly divided into different machines, but based on the value of the hash to map them to a different machine processing, so that different machinery processing a range of values.

The out-of-order method consumes a lot of IO and is not very efficient. And the above distributed method, can also be used in a standalone version, that is, the total data according to the range of values, divided into a number of different sub-files, and then processed individually. After processing, a merge of these words and their occurrence frequency is done. You can actually take advantage of an out-of-order merge process.

It is also possible to consider approximate calculations, that is, by combining natural language attributes, only those words that actually appear most in the real world are used as a dictionary, so that this size can be put into memory.

Introduce the processing method of massive data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.