[10 of algorithm series] Big Data processing tools: bloom filter and bloom Filter

Source: Internet
Author: User

[10 of algorithm series] Big Data processing tools: bloom filter and bloom Filter
[Introduction] in daily life, when designing computer software, we often need to determine whether an element is in a collection. For example, in word processing software, you need to check whether an English word is correctly spelled (that is, whether it is in a known dictionary). In the FBI, whether the name of a suspect is already on the suspect list; whether a website has been accessed in a web crawler; and so on. The most direct method is to store all the elements in the set in the computer. When a new element is encountered, you can directly compare it with the elements in the set. Generally, a set in a computer is stored as a hash table. Its advantage is fast and accurate, but its disadvantage is that it is a free storage space. This problem is not significant when the set is relatively small, but when the set is large, the problem of low storage efficiency of the hash table becomes apparent. For example, a public email (email) provider like Yahoo, Hotmail, and Gmai always needs to filter spamer mails from spamer. One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them. If a hash table is used, 0.1 billion email addresses are stored each time, 1.6 GB of memory is required (the specific method to implement the hash table is to convert each email address into an eight-character information fingerprint (for details, see math's information fingerprint ), then, store the information fingerprint to the hash table. Because the storage efficiency of the hash table is generally only 50%, an email address needs to occupy 16 bytes. The 0.1 billion addresses are about 1.6 GB, that is, 1.6 billion bytes of memory ). Therefore, storing billions of email addresses may require hundreds of GB of memory. Servers cannot be stored unless they are super computers.
Today, we will introduce a mathematical tool called bloom filter. It only needs to hash the size of the table from 1/8 to 1/4 to solve the same problem. (The beauty of mathematics) [Overview]

Bloom Filter was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random ing functions.

The bloom filter can be used to retrieve whether an element is in a collection.

Its advantage is that the space efficiency and query time far exceed the average algorithm. Its disadvantage is that it has a certain false recognition rate and difficulty in deletion.

[Working principle]

We use the example of the email above to describe how it works.

Assume that the storage uses 0.1 billion email addresses. First, create a 1.6 billion binary (BIT) vector, that is, a 0.2 billion-byte vector, and then clear all the 1.6 billion binary bits.

For each email address X, use eight different random number generators (F1, F2 ......... f8) generates 8 Information fingerprints (f1, f2 ,...... f8 ).

When we use a random number generator G to map the eight information fingerprints to the eight natural numbers g1, g2. ...... g8 in 1-16 million. Now set all the nine locations to 1. After these 0.1 billion emails are processed in this way

The email address bloom filter is built.


Now let's see how to use the bloom filter to check whether a suspicious email address Y is in the blacklist. Generate an address using the same 8 random numbers (F1, F2,... F8 ).

Eight information fingerprints (s1, s2 ,..... s8), and then the eight information fingerprints are mapped to the 8 binary bits of the bloom filter, t1, t2 ,.... t8.

If Y is in the blacklist, it is clear that t1, t2,... t8 corresponds to 8 binary bits must be 1. In this way, you can accurately find the email addresses in the blacklist.


To put it bluntly, the principle is very simple. The bitwise array and k different HASH functions are used. Set the bit array of the value corresponding to the HASH function to 1. If you find that all the corresponding bits of the HASH function are 1, it means they exist.

[Set expression and element query]

Next, let's take a look at how the bloom filter uses a bit array to represent a set. In the initial state, the bloom filter is an array containing m bits, each of which is set to 0.

To express S = {x1, x2 ,..., Xn} is a set of n elements. Bloom Filters use k Hash functions that are independent of each other. They map each element in the set to {1 ,..., M} range.

For any element x, the position h (I, x) mapped by the I hash function is set to 1 (1 ≤ I ≤ k, indicates the I-th hash function ).

Note: If a location is set to 1 multiple times, only the first time will take effect, and the next few times will not have any effect.

In, k = 3, and two hash functions select the same position (the fifth digit on the left ).

 

When judging whether y belongs to this set, we apply k hash functions to y. If all h (I, y) positions are 1 (1 ≤ I ≤ k ), then we think y is the element in the Set, otherwise we think y is not the element in the set.

Y1 is not an element in the set. Y2 or belongs to this set, or it is just a false positive.

[Incorrect identification]

(The beauty of mathematics)






From this formula, we can see that:

K = ln2 * m/nMinimum p


Determine the m size of the bit array and the number of hash functions based on the number of input elements n.

When the number of hash FunctionsK = ln2 * m/nMinimum Error Rate.

When the error rate p is not greater than E:

Launch:


If the error rate is not greater than E, m must at least be equal to the set of any n elements.

However, m should be larger, because it must ensure that at least half of the bit array is 0, then m should be greater than or equal to approximately 1.44 times that of nlg (1/E.

The mathematical principle behind the bloom filter is that the probability of two completely random mathematical conflicting peaks is very small. Therefore, a large amount of information can be stored in a small space without a recognition rate.

[Applicability]

It can be used to implement a data dictionary, to determine the duplication of data, or to obtain the intersection of data sets.











Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.