The beauty of Mathematics series 21-Bloom Filter

Source: Internet
Author: User

Original: http://googlechinablog.blogspot.com/2007/07/bloom-filter_7469.html

In daily life, including the design of computer software, we often have to determine whether an element is in a set. For example, in word processing software, you need to check whether an English word is spelled correctly (that is, to determine if it is in a known dictionary); at the FBI, whether a suspect's name is on the list of suspects, or whether a Web site has been visited in a web crawler, and so on. The most straightforward approach is to have all the elements of the collection present on the computer, and when you encounter a new element, compare it directly to the elements in the collection. In general, a collection of computers is stored in a hash table (hash table). Its benefits are fast and accurate, and the disadvantage is the cost of storage space. This problem is not significant when the set is relatively small, but when the collection is large, the problem of low storage efficiency of the hash table becomes apparent. For example, a public email provider, like Yahoo,hotmail and Gmai, always needs to filter spam from people who send spam (Spamer). One way to do this is to keep a record of the email addresses that were sent to spam. Since those senders are constantly registering new addresses, there are billions of more spam addresses around the world, and it takes a lot of Web servers to save them all. If you use a hash table, each store 100 million e-mail addresses, you need 1.6GB of memory (the specific way to implement the hash tables is to match each email address into a eight-byte information fingerprint googlechinablog.com/2006/08/ Blog-post.html, and then put this information fingerprint into a hash table, because the hash table storage efficiency is generally only 50%, so an email address needs to occupy 16 bytes. 100 million addresses are approximately 1.6GB, or 1.6 billion bytes of memory. Therefore, storing billions of e-mail addresses may require hundreds of gigabytes of memory. The general server cannot be stored unless it is a supercomputer.

Today, we introduce a mathematical tool called the filter, which only needs a hash table size of 1/8 to 1/4 to solve the same problem.

The Bron filter was proposed by Barton Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. We use the above example to illustrate how it works.

Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002 binary (bit), or 200 million-byte vector, and then all of the 1.6 billion binary zeros. For each e-mail address X, we use eight different random number generator (F1,F2, ..., F8) to generate eight information fingerprints (F1, F2, ..., F8). then using a random number generator G to map these eight information fingerprints to eight natural numbers from 1 to 1.6 billion G1, G2, ..., G8. Now let's set the binary of all eight locations to one . When we do this with all 100 million email addresses. A filter for these email addresses was built. See

Now, let's see how to use the filter to detect whether a suspicious e-mail address, Y, is in the blacklist. We use the same eight random number generator (F1, F2, ..., F8) to generate eight information fingerprints for this address s1,s2,..., S8, and then correspond these eight fingerprints to the Bron filter eight bits, respectively T1,t2,..., T8. If Y is in the blacklist, it is clear that the T1,T2,.., T8 corresponding Eight binary must be one . This way, we can find out exactly what the email address is in the blacklist.

The Bron filter never misses any suspicious address in the blacklist. However, it has one shortcoming. That is, it has a very small possibility to identify an email address that is not blacklisted as a blacklist, because it is possible that a good e-mail address happens to correspond to a eight bits that are set to one. Fortunately, this is a very small possibility. We call it the probability of false recognition. In the above example, the probability of false identification is below one out of 10,000.

The advantage of the Bron filter is that it is fast and saves space. But there is a certain rate of false recognition. A common remedy is to create a small whitelist that stores e-mail addresses that may not be misjudged.

Beauty of Mathematics series 21-Bloom filter

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.