Bloom Filter)

Source: Internet
Author: User

In daily life, when designing computer software, we often need to determine whether an element is in a collection. For example, in word processing software, you need to check whether an English word is correctly spelled (that is, whether it is in a known dictionary). In the fbi, whether the name of a suspect is already on the suspect list; whether a website has been accessed in a web crawler; and so on. The most direct method is to store all the elements in the set in the computer. When a new element is encountered, you can directly compare it with the elements in the set. Generally, a set in a computer is stored as a hash table. Its advantage is fast and accurate, but its disadvantage is that it is a free storage space. This problem is not significant when the set is relatively small, but when the set is large, the problem of low storage efficiency of the hash table becomes apparent. For example, a public email (email) provider like Yahoo, Hotmail, and gmai always needs to filter spamer mails from spamer. One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them. If a hash table is used, 0.1 billion email addresses are stored each time, it requires 1.6 GB of memory (the specific method to implement the hash table is to convert each email address into an eight-character message fingerprint limit 50%. Therefore, an email address occupies 16 bytes. The 0.1 billion addresses are about 1.6 GB, that is, 1.6 billion bytes of memory ). Therefore, storing billions of email addresses may require hundreds of GB of memory. Servers cannot be stored unless they are super computers.

Today, we will introduce a mathematical tool called bloom filter. It only needs to hash the size of the table from 1/8 to 1/4 to solve the same problem.

The bloom filter was proposed by Barton bloom in 1970. It is actually a very long binary vector and a series of random ing functions. The preceding example shows how the job works.

Suppose we store 0.1 billion email addresses. First we create a 1.6 billion binary (BIT) vector, that is, a 0.2 billion-byte vector, and then set all the 1.6 billion binary values to zero. For each email address X, we use eight different random number generators (F1, F2 ,..., f8) generates eight information fingerprints (F1, F2 ,..., f8 ). Use a random number generator g to map these eight information fingerprints to eight natural numbers G1, G2,... and G8 in the range of 1 to 1.6 billion. Now we set all the binary values of these eight locations to one. After processing all the 0.1 billion email addresses in this way. A bloom filter for these email addresses is built. (See)

Now let's see how to use the bloom filter to check whether a suspicious email address y is in the blacklist. We use the same eight random number generators (F1, F2 ,..., f8) generates eight fingerprints for this address: S1, S2 ,..., s8, and then map the eight fingerprints to the eight binary digits of the bloom filter, T1, T2 ,..., t8. If y is in the blacklist, it is clear that the eight binary values corresponding to T1, T2,... and T8 must be one. In this way, we can accurately find any email address in the blacklist.

The bloom filter will never miss any suspicious address in the blacklist. However, it has one disadvantage. That is, it is very small that it may judge an email address that is not in the blacklist as in the blacklist, it is possible that a good email address corresponds to eight binary digits. Fortunately, this possibility is very small. We call it false recognition probability. In the preceding example, the probability of false recognition is less than one thousandth. 

The advantage of bloom filter is that it is fast and saves space. However, there is a certain false recognition rate. A common remedy is to create a small whitelist to store mail addresses that may not be misjudged.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.