Background
In daily life, when designing computer software, we often need to determine whether an element is in a collection. For example, in the word processing software, you must check whether an English word is correctly spelled (that is, whether it is in a known dictionary). In the fbi, whether the name of a suspect is already on the suspect list; whether a website has been accessed in a web crawler; and so on. The most direct method is to store all the elements in the set in the computer. When a new element is encountered, you can directly compare it with the elements in the set.
Generally, a set in a computer is stored as a hash table. Its advantage is fast and accurate, but its disadvantage is that it is a free storage space. This problem is not significant when the set is relatively small, but when the set is large, the problem of low storage efficiency of the hash table becomes apparent.
For example, a public email (email) provider like Yahoo, Hotmail, and gmai always needs to filter spamer mails from spamer. One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them. If a hash table is used and each 0.1 billion email addresses are stored, 1.6 GB of memory is required (the specific implementation of the hash table is to convert each email address into an eight-character information fingerprint, then, store the information fingerprint to the hash table. Because the storage efficiency of the hash table is generally only 50%, an email address needs to occupy 16 bytes. The 0.1 billion addresses are about 1.6 GB, that is, 1.6 billion bytes of memory ). Therefore, storing billions of email addresses may require hundreds of GB of memory. Generally, servers cannot be stored unless they are super computers.
Application scenarios of bloom Filter
We will introduce a mathematical tool called bloom filter, which can solve the same problem only by hashing the size from 1/8 to 1/4.
The bloom filter will never miss any suspicious address in the blacklist. However, it has one disadvantage. That is, it is very small that it may judge an email address that is not in the blacklist as in the blacklist, it is possible that a good email address corresponds to eight binary digits. Fortunately, this possibility is very small. We call it false recognition probability.
The advantage of bloom filter is that it is fast and saves space, but has a certain false recognition rate.
Use of bloom Filter
Usage of false positives cannot be accepted
Taking the example of registering a user as an example, we use the bloom filter to create a registration User Name List to determine whether the user can be registered. The procedure is as follows:
1. Pass in the pass of the registered user and check whether the user exists in the bloom Filter Based on the bloom filter of the registered user.
2. Assume that the user does not have a bloom filter set. If the element is not in the Set, the bloom filter will not report false positives, so you can safely return the result that the user can successfully register.
3. assuming that the user exists in the bloom filter, the bloom filter may report false positives for the results of the elements in the set. Therefore, we need to query the real database again to check whether the user is actually registered.
Use of false positives
For the blacklist filtering of spam, It is very small that it may judge an email address that is not in the blacklist as in the blacklist.
A common remedy is to create a small whitelist to store mail addresses that may not be misjudged.
Reference http://hi.baidu.com/godduty/item/eb60342e743f710772863e74