July 3, 2007 09:35:00 Publisher: Google (Google) researcher Wu
In daily life, including in the design of computer software, we often have to judge whether an element is in a set. For example, in word processing software, it is necessary to check whether an English word is spelled correctly (that is, to determine if it is in a known dictionary), whether the name of a suspect is already on the suspect list in the FBI, whether a Web site has been visited in a web crawler, and so on. The most straightforward approach is to have all the elements of the collection in the computer, and when you encounter a new element, compare it directly to the elements in the collection. Generally speaking, a collection in a computer is stored with a hash table. Its advantage is fast and accurate, the disadvantage is the cost of storage space. This problem is not significant when the set is compared to an hour, but when the collection is large, the problem of inefficient hash table storage becomes apparent. For example, a public email provider like Yahoo,hotmail and Gmai always needs to filter spam from people who send spam (Spamer). One way to do that is to keep a record of e-mail addresses that send spam. As those senders are constantly registering new addresses, the world says there are billions of spam addresses, and it requires a large number of Web servers to save them. If you use a hash table, each store 100 million email addresses, you will need 1.6GB of memory (the specific way to achieve by hashing is to each email address into a eight-byte information fingerprint googlechinablog.com/2006/08/ Blog-post.html, and then put this information fingerprint into a hash table, because the hash table storage efficiency is generally only 50%, so an email address needs to occupy 16 bytes. 100 million addresses are approximately 1.6GB, or 1.6 billion bytes of memory. So storing billions of mail addresses may require hundreds of gigabytes of RAM. The general server is not stored unless it is a supercomputer.
Today, we introduce a mathematical tool called the filter, which solves the same problem only by the size of a hash table from 1/8 to 1/4.
The Prum filter was presented by Barton Prum in 1970. It is actually a very long binary vector and a series of random mapping functions. We use the above example to explain how the work works.
Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002-bit, or 200 million-byte vector, and then set all 1.6 billion binaries to zero. For each email address X, we use eight different random number generator (F1,F2, ..., F8) to produce eight information fingerprints (F1, F2, ..., F8). Then use a random number generator G to map these eight information fingerprints to eight natural numbers in 1 to 1.6 billion G1, G2, ..., G8. Now we set the binaries in these eight locations to one. When we do this with all 100 million email addresses. A cloth-lung filter For these email addresses was built. (see chart below)
Now, let's see how to detect a suspicious email address Y in the blacklist with a profiler. We use the same eight random number generator (F1, F2, ..., F8) to generate eight information fingerprints on this address s1,s2,..., S8, then the eight fingerprints to the Prum filter eight bits, respectively T1,t2,..., T8. If Y is in the blacklist, obviously, the T1,t2,.., T8 corresponding Eight binary must be one. In this way, we can find any email address in the blacklist.
The Prum filter never misses any suspicious address in the blacklist. However, it has one shortcoming. That is, it has a very small possibility to determine an e-mail address that is not in the blacklist as a blacklist, because it is possible that a good email address is set to a bits of eight. Fortunately, this possibility is very small. We call it a false probability. In the above example, the probability of false recognition is below one out of 10,000.
The advantage of the Prum filter is that it is fast and saves space. But there is a certain rate of false recognition. A common remedy is to create a small whitelist that stores e-mail addresses that may not be misjudged.
From:http://www.google.com.hk/ggblog/googlechinablog/2007/07/bloom-filter_7469.html