A summary of algorithms for mass data processing algorithm

Source: Internet
Author: User
Tags arrays bit set hash modulus sort


1. Bloom Filter

"Bloom Filter"
Bloom Filter (BF) is a space-efficient random data structure that uses a bit array to represent a collection very succinctly and to determine whether an element belongs to this collection. It is a fast probability algorithm for determining whether an element exists in a set. Bloom filter may have a false judgment, but will not miss the judgment. That is, the bloom filter judgment element is no longer set, that is certainly not. If the judgment element exists in the set, there is a certain probability of judging the error. Therefore, Bloom filter is not suitable for those "0 error" applications.

In applications where low error rates are tolerated, Bloom filter greatly saves space compared to other common algorithms such as hash, binary lookup.

Bloom Filter Details: Bloom filter in massive data processing

"Scope of Application"
Can be used to implement the data dictionary, the data of the weight, or set to find the intersection


"Fundamentals and Essentials"

Principle points: One is a bit array, but K independent hash function.

1) bit array:

Assuming that Bloom filter uses an array of M-bits to hold the information, the Bloom filter is an array of bits with M-bits in its initial state, with each bit set to 0, that is, the elements of the entire array of the BF are 0.


2) K Independent hash function

To express s={x1, X2,..., xn} A collection of n elements, Bloom filter uses K-independent hash functions (hash function), which map each element in the collection to the scope of {1,..., m}, respectively.

When we add any element x to the bloom filter, we use the K hash function to get the K hash value, then set the corresponding bit in the array to 1. That is, the location of the Hashi (x) of the hash function map will be set to 1 (1≤i≤k). Note that if a position is set to 1 multiple times, only the first time will work, and the next few times will have no effect. In the following figure, K=3, and two hash functions are selected in the same position (from the fifth digit to the left, i.e. the second "1").


3) Determine if the element has a collection

In determining whether Y belongs to this set, we only need to use K hash function for Y to get K hash value, if all Hashi (y) position is 1 (1≤i≤k), that is, K position is set to 1, then we think y is the element in the collection, otherwise we think y is not the element in the collection. The Y1 in the following figure is not an element in the collection (because the Y1 has one point pointing to the "0" bit). Y2 either belongs to this set or is just a false positive.



Obviously this judgment does not guarantee that the result of the search is 100% correct.

Disadvantages of Bloom Filter:

1) Bloom Filter cannot delete an element from the Bloom filter collection. Because the corresponding bits of the element affect other elements. So a simple improvement is counting Bloom filter, which can support deletion by replacing the bit array with a counter array. In addition, the hash function selection of Bloom filter affects the effect of the algorithm.

2) There is a more important question, how to determine the size of the bit array m and the number of hash functions according to the number of input elements n, that is, the hash function selection will affect the effectiveness of the algorithm . the error rate is minimized when the number of hash functions is k= (LN2) * (m/n). In cases where the error rate is not greater than E, M must be at least equal to N*LG (1/e) to represent a collection of any n elements. But M should also be larger, because it is also guaranteed that at least half of the bit array is 0, then M should >=NLG (1/e) *lge, probably NLG (1/e) 1.44 times times (LG represents 2 logarithm).

For example, we assume that the error rate is 0.01, then M should be about 13 times times the N. So k is probably 8.

Attention:

Here m is different from N's units, M is bit, and N is the number of elements (exactly the number of different elements). The length of a single element is usually a lot of bits. So the use of Bloom filter memory is usually saved.

The general BF can be used in conjunction with some key-value databases to speed up queries. Since the space used by BF is very small, all BF can reside in memory. In this case, for most of the non-existent elements, we only need to access the in-memory bf can be judged out, only a small part, we need to access the Key-value database on the hard disk. Thus greatly improving the efficiency.

Extension
Bloom filter maps the elements in the collection into the array, with K (k for the hash function number), whether all 1 indicates that the element is not in this set. Counting Bloom Filter (CBF) expands each bit in the bit array to a counter, enabling the deletion of the element. Spectral Bloom Filter (SBF) associates it with the number of occurrences of the collection element. SBF uses the minimum value in counter to approximate how often the element appears.


"Problem instance"
Give you a A, b two files, each store 5 billion URLs, each URL occupies 64 bytes, memory limit is 4G, let you find the common URL of a, b file. If it is three or even n files.
According to this problem we calculate the use of memory, 4G=2^32 is probably 4 billion *8 is about 34 billion BIT,N=50 billion, if the error rate 0.01 is required is about 65 billion bit. Now available is 34 billion, the difference is not much, this may cause the error rate to rise some. In addition, if these urlip are one by one corresponding, they can be converted to IP, it is much simpler.


2. Hash

"What is a hash"  
       Hash, the general translation to do "hash", also have direct transliteration to "hash", that is, the arbitrary length of the input (also known as pre-mapping, Pre-image), the hash algorithm, transformed into a fixed-length output, the output is a hash value. This conversion is a compression map, that is, the space of the hash value is usually much smaller than the input space, the different inputs may be hashed to the same output, but not from the hash value to uniquely determine the input value. Simply, a function that compresses messages of any length to a message digest of a fixed length.  
       Hash is mainly used for encryption algorithms in the field of information security, which translates some different lengths of information into cluttered 128-bit encodings, which are called hash values. It can also be said that Hash is to find a mapping between data content and data storage address.  
      arrays are characterized by easy addressing, insertion and deletion difficulties, and linked lists that are difficult to address, easy to insert and delete. So can we combine the characteristics of both, make an easy to address, insert delete also easy data structure. The answer is yes, this is the hash table we are going to mention, the hash table has a number of different implementations, I will explain the most commonly used one method-zipper method, (also a tree storage structure, called a binary list) we can understand as "array of linked list", as shown in Figure: 


                                                    
      the left side is obviously the array, each member of the arrays consists of a pointer to the head of a linked list, which of course may be empty or a lot of elements. We assign elements to different linked lists according to some of the characteristics of the elements, and we find the correct linked list based on these characteristics, and then we find this element from the list.  
the method by which the element feature is transformed into an array subscript is the hashing method.

Hashing is of course more than one, the following list of three more commonly used:
1, Division hashing (modulus)
The most intuitive one, the above image is the use of this hashing method, the formula:  
Index = value% 16 
learned the assembly. , the modulus is actually obtained by a division operation, so it is called "division hashing Method".  
2, the square hash method  
to find the index is very frequent operation, and the multiplication of the operation than division of time (for the current CPU, we do not feel), so we consider the division into multiplication and a displacement operation. Formula:  
Index = (value * value) >> 28 
This method can get good results if the value distribution is fairly uniform, but the values of the individual elements of the graph I drew above are calculated as the index 0-- Very failed. Perhaps you have a question, if value is large, value * value does not overflow. The answer is yes, but our multiplication does not care about overflow, because we are not at all to get the multiplication result, but to get index.  
3, Fibonacci (Fibonacci) hash method  
The disadvantage of the squared hashing method is obvious, so can we find an ideal multiplier instead of using value itself as a multiplier? The answer is yes.  
1, for 16-bit integers, the multiplier is 40503 
2, for 32-bit integers, the multiplier is 2654435769 
3, for 64-bit integers, This multiplier is 11400714819323198485 , and
How do these "ideal multipliers" come out? This is related to a law, called the golden Rule, and the most classical expression describing the golden rule is undoubtedly the famous Fibonacci sequence, if you are interested, to find the online "Fibonacci series" and other keywords, I have limited mathematics, I do not know how to describe why, And the Fibonacci numbers are surprisingly consistent with the orbital radii of the solar system's eight planets, that's amazing, right.
for our common 32-bit integers, the formula:  
I ndex = (value * 2654435769) >> 28 
If this Fibonacci scatter FPT, then my diagram above would be like this:  < br>


It is obvious that the Fibonacci hashing method is much better than the original method of fetching and hashing.
"Scope of Application"
Quick Find, delete the basic data structure, usually requires the total amount of data can be put into memory.
"Fundamentals and Essentials"
hash function selection, for strings, integers, permutations, the specific corresponding hash method.
Collision Handling:

One is open hashing, also known as Zipper method;

Another is closed hashing, also known as the Address law, opened addressing.
Extension
D-left hashing in D is a number of meanings, we first simplify this problem, take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, with a hash function for T1 and T2, H1 and H2. When a new key is stored, it is calculated with two hash functions, resulting in two addresses H1[key] and H2[key]. At this point you need to check the H1[key] position in the T1 and the H2[key] position in the T2, which location has been stored (collision) key more, and then store the new key in a low-load location. If the two sides are the same, for example, two positions are empty or all store a key, the new key is stored in the left T1 sub-table, 2-left also come. When looking for a key, you must make a hash of two times and find two positions.
"Problem instance"
1). Massive log data, extract the most visited Baidu one day the most number of that IP.
The number of IP is still limited, up to 2^32, so you can consider using a hash of the IP directly into memory, and then statistics.

3. Bit-map

"What is Bit-map?"
The so-called Bit-map is to use a bit bit to mark the value of an element, and key is the element. Because of the use of bit units to store data, the storage space can be greatly reduced.

If so much has not understood what a bit-map is, then let's look at a concrete example, assuming we want to sort the 5 elements (4,7,2,5,3) within 0-7 (assuming that these elements are not duplicated). Then we can use the Bit-map method to achieve the purpose of sorting. To represent 8 numbers, we only need 8 bit (1Bytes), first we open 1Byte space, all the bit bits of these spaces are set to 0 (as shown below:)


Then traversing the 5 elements, first the first element is 4, then 4 corresponds to the position of 1 (you can do this p+ (I/8) | (0x01<< (i%8)) Of course, the operation here involves Big-ending and little-ending, where the default is big-ending), because it is zero-based, so we have to place the fifth position as one (as shown below):


Then the second element 7 is processed, the eighth position is set to 1, and then the third element is processed, until the final processing of all the elements, the corresponding position is 1, the state of the memory bit is as follows:



Then we now traverse through the bit area, which is the number output (2,3,4,5,7) of bits of a bit, so that the order is reached. The following code gives a bitmap usage: sort.

C code

Define 8 bit bits in each byte            #include   <memory.h>             #define  BYTESIZE 8           void setbit (Char *p, int posi)            {              for (int i=0;  i <  (posi/bytesize);  i++)                {                   p++;              }                           *p = *p| (0x01<< (posi%bytesize));//Assign the bit bit 1              return;         }                    void bitmapsortdemo ()            {               //for the sake of simplicity, we don't consider negative numbers      &NBSP;&AMP;NB

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.