1.Bloom Filter
Scope of application: can be used to implement the data dictionary, the data of the weight, or set to find the intersection
Basic principles and key points:
For the principle is very simple, bit array +k a separate hash function. Place the bit array of the value corresponding to the hash function at 1,
If you find that all hash function corresponding bits are 1, it is obvious that this process does not guarantee that the lookup
The result is 100% correct. It is also not supported to delete a keyword that has been inserted, because the bit that corresponds to the keyword
Will affect the other keywords. So a simple improvement is counting Bloom filter, with a
Counter array instead of bit array, you can support the deletion.
There is one more important question,How toThe size of the bit array m and the hash function are determined according to the number of input elements N.
Number. The error rate is minimized when the number of hash functions is k= (LN2) * (m/n). In cases where the error rate is not greater than E, m at least
To be equal to N*LG (1/e) to represent a collection of any n elements. But M should also be bigger because the number of bits to be guaranteed
At least half of the group is 0, then M should be >=NLG (1/e) *lge is probably NLG (1/e) 1.44 times times (LG says 2 is the base
of the logarithm).
For example, we assume that the error rate is 0.01, then M should be about 13 times times the N. So k is probably 8.
Note that M is different from N's units, M is bit, and N is the number of elements (accurate is different
Number of elements). The length of a single element is usually a lot of bits. So use the Bloom filter memory to pass
are often saved.
Extended:
Bloom filter maps the elements in the collection into the array, with K (k for the hash function number) mapping bits
All 1 indicates that the element is not in this collection. Counting Bloom Filter (CBF) converts each bit in the bit array
Expands to a counter, which supports the deletion of elements. Spectral Bloom Filter (SBF) will
It is associated with the number of occurrences of the collection element. SBF uses the minimum value in counter to approximate how often the element appears.
Problem Example: give you a, b two files, each store 5 billion URLs, each URL occupies 64 bytes, memory limit is 4G
, lets you find common URLs for A/b files. What if it's three or even n files?
According to this problem we calculate the memory footprint, 4G=2^32 is about 4 billion *8 is probably 34 billion, N=50 billion,
If the error rate of 0.01 is required is about 65 billion bit. Now available is 34 billion, the difference is not much, this
May increase the error rate. In addition, if these urlip are one by one corresponding, they can be converted to IP, then the large
It's big and simple.
2.Hashing
Scope of application: Quick Find, delete the basic data structure, usually need the total amount of data can be put into memory
Basic principles and key points:
hash function selection, for strings, integers, permutations, the specific corresponding hash method.
Collisionprocessing, one is open hashing, also known as Zipper method, the other is closed hashing, also known as Open
Address law, opened addressing.
Extended:
D-left hashing in D is a number of meanings, we first simplify this problem, take a look at 2-left hashing. 2
-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, to T1 and T2
Each with a hash function, H1 and H2. When a new key is stored, it is also counted with two hash functions
Two addresses H1[key] and H2[key]. You need to check the H1[key] position in T1 and the h2[in T2
Key] position, which location has been stored (collision) key is more, and then store the new key in less load
The location. If there are as many as two, for example, two positions are empty or all store a key, save the new key
Stored on the left side of the T1 sub-table, 2-left also came. When looking for a key, the hash must be two times
Find two locations.
Problem Example:
1). Massive log data, extract the most visited Baidu one day the most number of that IP.
The number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then into
Row statistics.
3.bit-map
Scope of application: The data can be quickly found, the weight, delete, generally speaking, the data range is 10 times times the size of int
Fundamentals and key points: use bit arrays to indicate whether certain elements exist, such as 8-digit phone numbers
Extension: Bloom filter can be seen as an extension to Bit-map
Problem Example:
1) A file is known to contain some phone numbers, each number is 8 digits, the number of different numbers are counted.
8-bit up to 99 999 999, approximately 99m bits, about 10 m bytes of memory.
2) 250 million integers to find out the number of distinct integers, not enough memory space to accommodate these 250 million integers.
Expand the Bit-map, use 2bit to represent a number, 0 means no, 1 means one time, and 2 indicates
Now 2 times and above. Or we don't use 2bit to make representations, we can simulate this with two Bit-map
2bit-map.
4. Heap
Scope of application: mass data before n large, and n is small, heap can be put into memory
Basic principle and main points: the largest heap to find the first n small, the smallest heap before n large. methods, such as finding the first n small, we compare
The largest element in the previous element and the largest heap, and if it is smaller than the largest element, the largest element should be replaced. Such
The last n elements are the smallest n. ForBig Data Volume, to find the first n small, the size of n is relatively small case,
This allows you to scan once to get all the first n elements, which is highly efficient.
Extension: A double heap, a maximum heap combined with a minimum heap, can be used to maintain the median.
Problem Example:
1) The maximum number of the first 100 is found in the 100w number.
Use a minimum heap of 100 element sizes.
5. Double Barrel Division
Scope of application: K-Large, median, non-repeating or repetitive numbers
Basic principle and key points: Because the element scope is very large, can not use the direct addressing table, so through multiple division, gradually
Determine the scope and then finally take place within an acceptable range. Can be reduced by several times, the double layer is just a
Example.
Extended:
Problem Example:
1). 250 million integers to find out the number of distinct integers, memory space is not enough to accommodate these 250 million integers.
A bit like the pigeon nest principle, the whole number of 2^32, that is, we can divide the number of 2^32 into 2^8 areas
(for example, a single file represents an area), then separate the data into different regions, and then the different regions
It can be solved directly with bitmap. This means that as long as there is enough disk space, it can be easily solved.
2). 500 million int to find the median of them.
This example is more obvious than the one above. First we divide int into 2^16 region, then read the data statistics fall
To the number of the various regions, then we can judge the median to the region by the statistical results, and
The number of digits in this area is exactly the median. And then the second scan, we only counted in this area.
The number of them is all right.
In fact, if not int is int64, we can go through 3 of these divisions to reduce to acceptable processes
Degree. That is, you can divide the int64 into 2^24 regions and then determine the number of areas, in which the area is divided into 2^
20 sub-regions, and then determine the number of sub-regions, and then the number of sub-regions only 2^20, you can
To directly use direct addr table for statistics.
6. Database indexing
Scope of application:Big Data VolumeAdditions and deletions to change the search
The basic principle and key points: using the data design realization method, to the large amount of data deletion and modificationprocessing。
Extended:
Problem Example:
7. Inverted indexes (inverted index)
Scope of application: Search engine, keyword query
Rationale and key points: why is it called Inverted index? An indexing method that is used to store a word under a full-text search
A mapping of where a document or set of documents is stored.
In English, for example, here is the text to be indexed:
T0 = "It Is it"
T1 = "What's It"
T2 = "It is a banana"
We can get the following reverse file index:
' A ': {2}
"Banana": {2}
"is": {0, 1, 2}
"It": {0, 1, 2}
"What": {0, 1}
Retrieve the condition "what", "is" and "it" will correspond to the intersection of the set.
The forward index is developed to store a list of words for each document. Queries for positive indexes tend to satisfy each document
A query such as sequential and frequent full-text queries and validation of each word in a validating document. In a forward index, the document
Occupy the center of the location, each document points to a sequence of index entries it contains. This means that the document points
The words it contains, and the reverse index is the word that points to the document that contains it, it is easy to see the reverse
The relationship.
Extended:
Problem Example: A document retrieval system that queries those files that contain a word, such as a keyword for a common academic paper
Search.
8. Sorting outside
Scope of application: Big Data sorting, deduplication
Basic principle and key points: The merging method of the outer sort, the substitution choice loser tree principle, the optimal merging tree
Extended:
Problem Example:
1). There is a 1G size of a file, inside each line is a word, the size of the word not more than 16 bytes, memory
The limit size is 1M. Returns the highest frequency of 100 words.
This data has the very obvious characteristic, the word size is 16 bytes, but the memory only 1m does the hash some not enough,
So can be used to sort. Memory can be used when the input buffer.
9.trie Tree
Scope of application: Large amount of data, repeat many, but small data type can be put into memory
Basic principles and key points: the way of realization, the expression of children in the node
Extension: Compression implementation.
Problem Example:
1). There are 10 files, each file 1G, each line of each file is stored in the user's query, each file
The query may be repeated. I want you to sort by the frequency of the query.
2). 10 million strings, some of which are the same (repeat), need to remove all duplicates, keep No Duplicates
String. How to design and implement?
3). Search for popular queries: The query string has a high degree of repetition, although the total is 10 million, but if you remove the duplicate, do not
More than 3 million, each with a maximum of 255 bytes.
10. DistributedprocessingMapreduce
Scope of application: large data volume, but small data type can be put into memory
Fundamentals and Key points: handing the data to different machinesprocessing, the data is divided and the result is normalized.
Extended:
Problem Example:
1). The canonical example application of MapReduce is a process to count the
Appearances of
Each different word in a set of documents:
void Map (string name, string document):
Name:document Name
Document:document contents
For each word W in document:
Emitintermediate (w, 1);
void reduce (String word, Iterator partialcounts):
Key:a Word
VALUES:A list of aggregated partial counts
int result = 0;
For each V in partialcounts:
Result + = parseint (v);
Emit (result);
Here, each document was split in words, and each word was counted initially
With a ' 1 ' value by
The MAP function, using the word as the result key. The framework puts
Together all the pairs
With the same key and feeds them to the same call to Reduce, thus this
function just needs to
Sum all of it input values to find the total appearances of this word.
2). Mass data distributed in 100 computers, think of a way to efficiently calculate the TOP10 of the data.
3). A total of n machines, with n number on each machine. The maximum number of O (N) per machine is saved and manipulated. Such as
How to find the median number of n^2 (median)?
Classic problem Analysis
Tens of millions of or billions of data (there are duplicates), statistics of the most occurrences of the top n data, in two cases: can be
Read into memory, not read in once.
Available ideas: Trie tree + heap, database index, partition subset statistics, hash, distributed computing, approximate system
Out, sort
The so-called ability to read into memory at one time should actually mean the amount of data removed after duplication. If the data can be re-
Put in memory, we can create a dictionary for the data, such as through Map,hashmap,trie, and then directly
Statistics can be. Of course, when updating the number of occurrences of each piece of data, we can use a heap to maintain the occurrence of the secondary
The largest number of top N data, of course, this leads to increased maintenance, rather than full statistics after the first n high efficiency.
If the data cannot be put into memory. On the one hand, we can consider whether the above dictionary method can be improved to suit this situation.
Can be changed by storing the dictionary on the hard disk instead of memory, which can refer to the storage side of the database
Method.
Of course, there is a better way, that is, you can use distributed computing, basically is the map-reduce process, the first can
According to the data value or the data hash (MD5) value, the data according to the scope of the different machines, the best
Allows data to be read into memory once, so that different machines are responsible forprocessingVarious numerical ranges, the actual
is the map. When the results are obtained, each machine simply takes out the top n data of its respective number of occurrences, then sinks
Total, select the top n data in all data, which is actually the reduce process.
You might actually want to distribute the data directly to different machines.processing, it is impossible to get the correct solution.
Because one data may be evenly divided between different machines, and the other may be fully assembled on a machine, the same
The same number of data may exist. For example, we are looking for the top 100 most occurrences, we will
10 million of the data spread to 10 machines, find the top 100 of each of the most occurrences, after merging this does not
can be guaranteed to find the real 100th, because for example, the number of the 100th most likely to have 10,000, but it is
There are 10 machines, so there are only 1000 on each platform, assuming that these devices are ranked before 1000
is distributed on a single machine, for example, there are 1001, so that would have 10,000 of this will be eliminated,
Even if we let each machine choose the 1000 most occurrences of the merge, there will still be errors, because there may be a large
The number of quantities is 1001 occurrences of aggregation. Therefore, the data can not be randomly divided into different machines, but according to the hash
Values to map them to different machines.processing, so that the different machinesprocessingA range of values.
The out-of-order method consumes a lot of IO and is not very efficient. The above distributed method can also be used for single
The total data is divided into a number of different sub-files based on the range of values, and then individuallyprocessing。
processingAfter that, a merge of these words and their occurrence frequency is done. You can actually use an outer row
The merging process of the sequence.
It is also possible to consider approximate calculations, that is, we can combine natural language attributes, only those that are actually
The most occurrences of those words as a dictionary, so that the size can be put into memory.
The idea of Java processing large data Volume task--unverified version, the concrete implementation method needs to be practiced