"Introduction"In daily life, including the design of computer software, we often have to determine whether an element is in a set. For example, in word processing software, you need to check whether an English word is spelled correctly (that is, to determine if it is in a known dictionary); at the FBI, whether a suspect's name is on the list of suspects, or whether a Web site has been visited in a web crawler, and so on. The most straightforward approach is to have all the elements of the collection present on the computer, and when you encounter a new element, compare it directly to the elements in the collection. Generally speaking, a collection in a computer ishash table (hash table)to store the. The advantage of it is thatFast and accurate, the disadvantage isFee Storage Space。 This problem is not significant when the set is relatively small, but when the collection is large, the problem of low storage efficiency of the hash table becomes apparent. For example, a public email provider, like Yahoo,hotmail and Gmai, always needs to filter spam from people who send spam (Spamer). One way to do this is to keep a record of the email addresses that were sent to spam. Since those senders are constantly registering new addresses, there are billions of more spam addresses around the world, and it takes a lot of Web servers to save them all. If you use a hash table, each store 100 million e-mail addresses, you need 1.6GB of memory (the specific way to do this is to map each email address into a eight-byte information fingerprint (see: Mathematical beauty of the information fingerprint), and then put the information fingerprint into a hash table, Since the storage efficiency of a hash table is typically only 50%, an email address needs to occupy 16 bytes. 100 million addresses are approximately 1.6GB, or 1.6 billion bytes of memory. Therefore, storing billions of e-mail addresses may require hundreds of gigabytes of memory. The general server cannot be stored unless it is a supercomputer.
Today we introduce a mathematical tool called the filter, which only needs a hash table size of 1/8 to 1/4 to solve the same question.questions. (The beauty of mathematics)"Introduction"
The Bron filter (Bloom filter) was proposed by Bron in 1970. It is actually a very long binary vector and a series of random mapping functions .
The Bron filter can be used to retrieve whether an element is in a collection.
Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of error recognition and removal difficulties .
"How it Works"
Let's explain how it works by using the example above for email.
Assuming that the store takes 100 million e-mail addresses, first create a 1,600,000,002-bit, or 200 million-byte vector, and then zero all of the 1.6 billion bits.
For each e-mail address x, use 8 different random number generator (F1,f2 ...). F8) generates 8 information fingerprints (F1,f2,...... f8).
A random number generator g is used to map these 8 information fingerprints to 8 natural number G1,g2.....g8 in 11.6 billion. Now set all 9 positions to 1. After all of these 100 million e-mails are processed, a
e-mail address the filter was built.
Now, let's look at how to use the filter to detect if a suspicious e-mail address, y, is in the blacklist. Use the same 8 random numbers (F1,f2,.... F8) generator is generated for this address
8 Information Fingerprints (s1,s2,..... S8), then the 8 fingerprint corresponding to the Bron filter 8 bits, respectively T1,t2,.... T8.
If Y is in the blacklist, it is clear that T1,t2,... T8 corresponds to 8 bits must be 1. This way, if you encounter the blacklist of e-mail addresses can be accurately found.
Plainly, the principle is simple, using bit arrays and k different hash functions. The bit array of the value corresponding to the hash function is set to 1, and if all the corresponding bits of the hash function are found, the 1 description exists.
"collection Representations and element queries"
Let's look at how the Bron filter uses a bit array to represent the collection. In the initial state, the Bron filter is an array of bits with M bits, each of which is set to 0.
To express s={x1, X2,..., xn} A collection of n elements, the Bron filter uses K-Independent hash functions (hashes), which map each element in the collection to the range of {1,..., m}, respectively.
For any one element x, the position H (i,x) of the I-hash function map is set to 1 (1≤i≤k, which represents the I-hash function).
Note that if a position is set to 1 multiple times, only the first time will work, and the next few times will have no effect.
In, k=3, and there are two hash functions with the same position selected (fifth digit from the left).
In determining whether Y belongs to this set, we apply the K-hash function to Y, and if all H (i,y) positions are 1 (1≤i≤k), then we think y is the element in the collection, otherwise we think y is not an element in the collection.
The y1 is not an element in the collection. Y2 either belongs to this set or is just a false positive.
"Mistaken identification problem"
(For the beauty of mathematics)
From this formula you can see:
k = LN2 * m/n P min
How to determine the size of the bit array m and the number of hash functions according to the number of input elements N.
The error rate is minimal when the number of hash functions is k = LN2 * m/n .
In the case of error rate p not greater than E:
Launch:
In cases where the error rate is not greater than E, M must be at least equal to represent a collection of any n elements.
But M should also be larger, because it is also guaranteed that at least half of the bit array is 0, then m should be greater than or equal to about 1.44 times times NLG (1/e).
The mathematical principle behind the Bron filter is that the probability of two completely random mathematical collision peaks is very small, so that a small amount of information can be stored with very little space under the condition of a very small rate of no recognition.
"Scope of application"
Can be used to implement the data dictionary, the data of the weight, or set to find the intersection
[Algorithm series of ten] Big Data processing weapon: Bron filter