Large data volume, mass data processing method summarizing __c language

Source: Internet
Author: User

The problem of large amount of data is a lot of questions that often arise in the interview written test, such as Baidu Google Tencent some of the companies involved in massive data often ask.

The following approach is a general summary of how massive data is handled, although these methods may not completely cover all the problems, but some of these methods can basically deal with most of the problems encountered. Some of the following questions directly from the company's interview written questions, the method is not necessarily optimal, if you have a better way to deal with, welcome to discuss with me. 1.Bloom Filter Scope of application: can be used to implement data dictionary, data weighing, or set to find intersection

Basic principles and key points:
For the principle of simplicity, the bit array +k a separate hash function. The hash function corresponds to the value of the bit array of 1, find if all the hash function corresponding bit is 1 indicates existence, it is obvious that this process does not guarantee that the result of the lookup is 100% correct. It also does not support deleting a keyword that has been inserted because the corresponding bit of the keyword will affect other keywords. So a simple improvement is counting Bloom filter, which uses a counter array instead of a bit array to support deletion.

A more important question is how to determine the size of the bit array m and the number of hash functions based on the number n of the input elements. The error rate is minimal when the number of hash functions k= (LN2) * (m/n). In cases where the error rate is not greater than E, M is at least equal to N*LG (1/e) to represent a collection of any n elements. But M should be bigger, because it also has to ensure that at least half of the bit array is 0, then M should be >=NLG (1/e) *lge probably NLG (1/e) 1.44 times times (LG says 2 logarithm).

For example, if we assume that the error rate is 0.01, then M should be about 13 times times that of N. So k is probably 8.

Note that here m is different from the unit of N, and M is a bit, and N is in the number of elements (exactly the number of different elements). Usually the length of a single element has a lot of bit. So the use of Bloom filter memory is usually saved.

Extended:
Bloom filter maps the elements in the collection to the array, using K (k as a hash function number) to indicate whether the element is not in the collection at all 1. Counting Bloom Filter (CBF) expands each bit in the bit array into a counter, thus supporting the deletion of the element. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in the counter to approximate the occurrence frequency of the element.

Problem Example: give you a,b two files, each store 5 billion URLs, each URL occupies 64 bytes, memory limit is 4G, let you find a,b file common URL. If it is three or even n files.

According to this problem we calculate the memory occupancy, 4G=2^32 is probably 4 billion *8 is probably 34 billion, N=50 billion, if the error rate of 0.01 is needed is about 65 billion bit. Now the 34 billion is available, not much, which may cause the error rate to rise. In addition, if these urlip are one by one corresponding, it can be converted to IP, it is much simpler. 2.Hashing Scope of application: Quick Find, delete the basic data structure, usually requires total amount of data can be put into memory

Basic principles and key points:
hash function selection, for string, Integer, permutation, specific corresponding hash method.
Collision treatment, one is open hashing, also known as Zipper method, the other is closed hashing, also known as the address of the law, opened addressing.

Extended:
The d in D-left hashing is a number of meanings, so let's simplify the problem and take a look at the 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is computed with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point, you need to check the H1[key] position in the T1 and the H2[key in the T2 position, which location has been stored (there is a collision) key more, and then store the new key in the location of less load. If the two sides are as many as two locations are empty or have stored a key, the new key is stored in the left T1 child table, 2-left also come from. When looking for a key, you must make a hash of two times and find two locations.

Problem Example:
1. Massive log data, extracted one day visit Baidu the most times that IP.

The number of IP is still limited, up to 2^32, so you can consider using a hash of IP directly into memory, and then statistics. 3.bit-map Scope of application: Data can be quickly found, weight, delete, generally speaking, the data range is 10 times times the size of int

Fundamentals and Essentials: Using bit arrays to indicate whether certain elements exist, such as 8-bit phone numbers

Extension: Bloom filter can be seen as an extension of the Bit-map

Problem Example:

1 A certain file contains a number of phone numbers, each number is 8 digits, statistics of the number of different numbers.

8 bits up to 99 999 999, approximately 99m bit, approximately 10 m bytes of memory can be.

2 250 million integers to find the number of distinct integers, the memory space is not enough to accommodate the 250 million integers.

Expand the Bit-map, use 2bit to represent a number, 0 means not appear, 1 indicates a single occurrence, 2 means 2 times and above. Or we do not use 2bit to represent, we can simulate the implementation of this 2bit-map with two bit-map. 4. Heap scope: Massive data before n large, and n relatively small, heap can be put into memory

Basic principle and main points: the largest heap for the first n small, minimum heap before n large. method, for example, to find the first n small, we compare the current element with the largest element in the maximum heap, if it is less than the maximum element, you should replace the largest element. So the last n element is the smallest n. Suitable for large amount of data, the first n small, n size relatively small situation, so you can scan it all the first n elements, high efficiency.

Expansion: Double-heap, a maximum heap combined with a minimum heap, can be used to maintain the median number.

Problem Example:
1 100w number to find the largest number of the first 100.

You can use a minimum heap of 100 element sizes. 5. Double Barrel Division Scope of application: K large, median, do not repeat or repeat the number

Basic principle and key points: Because the element scope is very big, cannot use the direct addressing table, therefore through the multiple division, the gradual determination scope, then finally in an acceptable scope carries on. Can be reduced by several times, the double layer is just an example.

Extension

Problem Example:
1.250 million integers to find the number of distinct integers, not enough memory space to accommodate the 250 million integers.

A bit like the pigeon nest principle, the integer number is 2^32, that is, we can divide the number of this 2^32 into 2^8 areas (such as a single file representing a region), and then separate the data into different areas, then the different regions in the use of bitmap can be directly resolved. In other words, as long as there is enough disk space, it can be very convenient to solve.

2). 500 million int to find the median number of them.

This example is more obvious than the one above. First we divide int into 2^16 regions, and then we read the number of numbers that fall into each area, and then we can tell by the statistical results that the median is going to be in that area, and we know that the number of numbers in this area is just the median number. And then the second scan we only count the numbers that fall in the area.

In fact, if not int is int64, we can go through this division 3 times to reduce to acceptable levels. That is, the int64 can be divided into 2^24 areas first, then determine the number of regions, in the area into the 2^20, and then determine the number of sub regions, and then the number of 2^20 in the subregion can be directly using direct addr table for statistics. 6. Database index scope of application: Large amount of data added or deleted to check

Basic principle and main points: use the data design realization method, the massive data's deletion to change checks carries on the processing.
Extended:
Problem Example: 7. Inverted index (inverted index) scope of application: Search engine, keyword query

Rationale and essentials: why inverted index. An indexing method that is used to store the mapping of a word's storage location in a document or group of documents under a Full-text search.

In English, for example, here is the text to be indexed:
T0 = "It is what it is"
T1 = "What Is It"
T2 = "It is a banana"
We can get the following reverse file index:
' A ': {2}
' Banana ': {2}
' is ': {0, 1, 2}
"It": {0, 1, 2}
"What": {0, 1}
The condition "what" that is retrieved, the "is" and "it" will correspond to the intersection of the collection.

A list of words that are being developed to the index to store each document. forward-indexed queries often satisfy queries such as full-text queries that each document is ordered frequently and validation of each word in a validating document. In a forward index, the document occupies the center position, and each document points to a sequence of the index entries it contains. That means the document points to the words it contains, and the reverse index is the word that points to the document that contains it, and it's easy to see the reverse relationship.

Extended:

Problem Example: Document retrieval system, query those files contain a word, such as a common academic paper keyword search. 8. Sort out scope of application: Large data sorting, to heavy

Basic principle and key points: The merging method of the outside ordering, the substitution chooses the principle of the loser tree, the optimal merging tree

Extended:

Problem Example:
1. There is a 1G size of a file, inside each row is a word, the word size is not more than 16 bytes, the memory limit size is 1M. Returns the 100 words with the highest frequency.

This data has the obvious characteristic, the word size is 16 bytes, but the memory only 1m does the hash some not enough, therefore may use for the sort. Memory can be used when input buffers. 9.trie Tree scope of application: Large amount of data, repeat more, but small data can be put into memory

Basic principle and key points: realization Way, node child's representation way

Extensions: Compact implementations.

Problem Example:
1. There are 10 files, each file 1G, each row of each file is stored in the user's query, each file query may be repeated. I want you to sort by the frequency of query.

2.10 million strings, some of which are the same (repeat), need to remove all duplicates, leaving no duplicate strings. How to design and implement.

3. Find Hot query: query string is a high degree of repetition, although the total is 10 million, but if the removal of duplicates, no more than 3 million, each no more than 255 bytes. 10. Distributed Processing MapReduce scope of application: Large amount of data, but small data types can be put into memory

Basic principles and key points: the data to different machines to deal with, data division, the result of the approximate.

Extension

Problem Example:

1). The canonical example application of MapReduce is a process to count the appearances of

Each different word in a set of documents:
void Map (string name, string document):
Name:document Name
Document:document contents
For each word W in document:
Emitintermediate (w, 1);

void reduce (String word, iterator partialcounts):
Key:a Word
VALUES:A list of aggregated partial counts
int result = 0;
For each V in partialcounts:
result = parseint (v);
Emit (result);
Here, the each document was split in words, and each word is counted initially with a "1" value by

The MAP function, using the word as the result key. The framework puts together all the pairs

With the same key and feeds them to the same call to Reduce, thus this function just needs to

Sum all of it input values to find the total appearances of this word.

2. Mass data distribution in 100 computers, think of a way to efficiently statistics the TOP10 of this batch of data.

3. A total of n machines, each machine has n number. The maximum number of O (N) per machine is stored and manipulated. How to find the number of n^2 (median).

Classical problem analysis

Tens of millions or billions of data (there are duplicates), the highest number of occurrences of the first n data, divided into two cases: can be read in memory, not one time read.

Available ideas: Trie tree + heap, database index, partition subset statistics, hash, distributed computing, approximate statistics, out of order

The so-called ability to read the memory at a time, in fact, should mean the amount of data to remove duplication. If the data can be put into memory after the heavy, we can create a dictionary for the data, such as through Map,hashmap,trie, and then directly to the statistics. Of course, in updating the number of occurrences of each data, we can use a heap to maintain the highest number of the first n data, of course, so that the number of maintenance increases, not as complete statistics in the first n high efficiency.

If the data cannot be put into memory. On the one hand, we can consider whether the above dictionary method can be improved to adapt to this situation, the change is to store the dictionary on the hard disk, rather than memory, which can refer to the storage method of the database.

Of course, there is a better way, is to use distributed computing, basically is the map-reduce process, the first can be based on data values or data hash (MD5) after the value of the data according to the range of different machines, it is best to allow the data can be divided into memory once, Such a different machine is responsible for dealing with a range of numerical values, which is actually map. After the results, each machine only need to come up with their own most occurrences of the first n data, and then summary, to select all the data in the most occurrences of the first n data, this is actually the reduce process.

In fact, you may want to divide the data directly into different machines for processing, so that you cannot get the correct solution. Because one data may be divided into different machines, and the other may be completely clustered on one machine, there may be the same number of data. For example, we want to find the top 100, we spread 10 million of the data to 10 machines, to find each number of the top 100, after merging this does not guarantee to find the real 100th, because, for example, the number of the most likely to be the 100th one may have 10,000, But it is divided into 10 machines, so there are only 1000 on each platform, assuming that these machines ranked 1000 before the ones are distributed on a single machine, such as 1001, so that the 10,000 would have been eliminated, Even if we let each machine pick up the number of the most occurrences of the 1000 and then merge, still error, because there may be a large number of 1001 of the occurrence of aggregation. Therefore, the data can not be randomly divided into different machines, but to be based on the value of the hash to map them to a different machine processing, so that the different computer processing a numerical range.

The way you sort it consumes a lot of Io, and it's not efficient. The above distributed method can also be used for stand-alone versions, that is, the total data is divided into several different sub files according to the range of values, and then processed individually. After processing, the words and the frequency of the occurrence of a merge. You can actually take advantage of an out of order merge process.

It is also possible to consider the approximate calculation, which means that we can combine the natural language attributes and only those words that actually appear most in the real world as a dictionary so that the scale can be put into memory.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.