[Switch] massive data processing algorithm-bloom Filter

Last Update:2014-08-03 Source: Internet

Author: User

Tags internet cache

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction to the bloom-Filter Algorithm

Bloom filter (BF) is a space-efficient random data structure. It uses a bit array to easily represent a set and can determine whether an element belongs to the set. It is a fast probability algorithm used to determine whether an element exists in a set. The bloom filter may have incorrect judgment, but does not miss the judgment. That is, when the bloom filter judges that elements are no longer set, it is definitely not. If the judgment element exists in the Set, there is a certain probability of an error. Therefore, the bloom filter is not suitable for applications with zero errors. In applications that can tolerate low error rates, Bloom filter greatly saves space than other common algorithms (such as hash and semi-query.

Its advantage is that the space efficiency and query time far exceed the average algorithm. Its disadvantage is that it has a certain false recognition rate and difficulty in deletion.

2. Basic Idea of bloom-Filter

The core idea of the bloom-filter algorithm is to use different hash functions to solve "conflicts ".

The first way to calculate whether an element x is in a collection is to save all known elements to form a set R, then, we can compare element x with the elements in these R to determine whether they exist in the Set R. We can use data structures such as linked lists to achieve this. However, as the number of elements in the Set R increases, the memory usage increases. Imagine if tens of millions of different webpages need to be downloaded, the required memory will be enough to occupy the memory address space of the entire process. That is, the MD5 and UUID methods are used to convert the URL into a fixed short string, and the memory usage is also quite large.

Therefore, we will think of using the hash table data structure, using a good enough hash function to map a URL to a bit in a binary array (Bitmap array. If this bit has been set to 1, it indicates that the URL already exists.

Hash has a conflict (collision) problem. The values of the two URLs obtained from the same hash may be the same. To reduce conflicts, we can introduce several more hash values. If one of the hash values is used to obtain that an element is not in the collection, the element is definitely not in the collection. Only when all hash functions tell us that the element is in the set can we determine that the element exists in the set. This is the basic idea of bloom-filter.

Key principles: one is a bit array and the other is k independent hash functions.

1) Bit Array:

Assume that the bloom filter uses an array of M bits to store information. In the initial state, the bloom filter is an array of M bits, and each bit is set to 0, that is, the element of the entire BF array is set to 0.

2) add elements and k independent hash functions

To express S = {x1, x2 ,..., Xn} is a set of n elements. Bloom filter uses k hash functions to map each element in the set to {1 ,..., M} range.

When we add any element X to the bloom filter, we use k hash functions to obtain K hash values, and then set the corresponding bit in the array to 1. That is, the position Hashi (x) mapped by hash function I is set to 1 (1 ≤ I ≤ k ).

Note: If a location is set to 1 multiple times, only the first time will take effect, and the next few times will not have any effect. In, K = 3, and two hash functions select the same position (from the fifth digit on the left, that is, the second "1 ).

3) determine whether an element has a set

When judging whether y belongs to this set, we only need to use k hash functions for y to obtain K hash values. If all Hashi (y) the positions are all 1 (1 ≤ I ≤ k), that is, K bit sets are set to 1, then we think y is the element in the set, otherwise, Y is considered not an element in the set. Y1 is not an element in the Set (because Y1 points to a "0" bit ). Y2 or belongs to this set, or it is just a false positive.

Obviously, this judgment does not guarantee that the search result is 100% correct.

Disadvantages of bloom filter:

1) The bloom filter cannot delete an element from the bloom filter set. Because the bit corresponding to this element affects other elements. Therefore, a simple improvement is the counting bloom filter, which can be deleted by replacing the bitwise array with a counter array. In addition, the selection of the hash function of the bloom filter will affect the algorithm performance.

2) there is another important question: how to determine the size of the Bit Array m and the number of hash functions based on the number of input elements N, that is, the selection of hash functions will affect the algorithm performance. When the number of hash functions is k = (ln2) * (M/N), the error rate is the minimum. If the error rate is not greater than E, m must at least be equal to N * lg (1/E) to represent a set of any n elements. But m should be larger, because it must ensure that at least half of the bit array is 0, then M should> = NLG (1/E) * LGE, it is about 1.44 times that of NLG (1/E) (LG indicates the base 2 logarithm ).

For example, if the error rate is 0.01, M is 13 times larger than N. In this case, K is about 8.

Note:

Here, M is different from N in units, M is bit, and N is based on the number of elements (accurately speaking, the number of different elements ). Generally, the length of a single element is many bits. Therefore, the use of bloom filter memory is usually saved.

Generally, BF can be used with some key-value databases to speed up queries. Since the BF space is very small, all BF can be resident in the memory. In this case, for most elements that do not exist, we only need to access the BF in the memory to determine that there is only a small part. We need to access the key-value database on the hard disk. This greatly improves the efficiency.

A bloom filter has the following parameters:

M	Bit Array width (bit number)
N	Number of keys added
K	Number of hash functions used
F	False positive ratio

3. Extended counterbloom filtercounterbloom Filter

Bloomfilter does not support deletion because it does not know which vectors a single bit belongs. Then we can add a counter to the bloom filter, add a counter when adding the filter, and reduce the counter when deleting the filter.

However, the size of the counter to be appended should be considered for such a filter. If the same element is inserted for multiple times, overflow may occur when the number of counters is small. If you set an upper limit for the counter, the cache will miss, but for some applications, this is not a problem, such as Web Sharing.

Compressed bloom Filter

In order to transmit the bloom filter over the network faster between servers, we have a way to compress some actual parameters after the bloom filter has been completed.

After adding all the elements to the bloom filter, we can get the actual space usage. Use this value to substitute the formula to calculate a value smaller than m and reconstruct the bloom filter, perform remainder processing on the original hash value, so that the memory size is more suitable when the false positive rate remains unchanged.

4. Application of bloom-Filter

Bloom-filter is used to determine whether an element exists in a collection of large data volumes. For example, the spam filter in the mail server. In the search engine field, bloom-filter is most commonly used for URL filtering by Spider. A web spider usually has a URL list that stores the URLs of the web pages to be downloaded and downloaded, after a Web page is downloaded from a web page and a new URL is extracted from the web page, you need to determine whether the URL already exists in the list. In this case, the bloom-filter algorithm is the best choice.

1. Key-value accelerates Query

Generally, bloom-filter can be used with some key-value databases to speed up queries.

Generally, the values of the key-value storage system has a hard disk, and querying is time-consuming. Insert all the stored data into the filter. If no query exists in the filter, you do not need to go to the storage query. When false position appears, it only causes an extra storage query.

Because bloom-filter uses a very small amount of space, all BF can be resident in the memory. In this case, for most elements that do not exist, we only need to access the bloom-filter in the memory to determine that there is only a small part, we need to access the key-value database on the hard disk. This greatly improves the efficiency.

2. Google's bigtable

Google's bigtable also uses bloom filter to reduce queries for non-existent rows or columns on the disk, which greatly improves the performance of database query operations.

3. Proxy-Cache

In Internet Cache Protocol, many proxy-Cache uses bloom filter to store URLs. In addition to efficient queries, proxy-Cache also easily transmits and exchanges cache information.

4. network applications

1) Search for resource operations in P2P networks. You can save the bloom filter for each network path. When hit, select this path for access.

2) When broadcasting a message, you can check whether an IP address has been packaged.

3) Check the loop of the broadcast message package, store the bloom filter in the package, and each node adds itself to the bloom filter.

4) information queue management: Use counter bloom filter to manage information traffic.

5. Spam address filtering

Public email providers such as Netease And QQ always need to filter spams from spamer.

One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them.

If a hash table is used and 0.1 billion email addresses are stored each time, GB of memory is required. (the specific implementation of the hash table is to convert each email address into an eight-character node fingerprint, then, store the information fingerprint to the hash table. Because the storage efficiency of the hash table is generally only 50%, an email address needs to occupy 16 bytes. The 0.1 billion addresses are about 1.6 GB, that is, 1.6 billion bytes of memory ). Therefore, storing billions of email addresses may require hundreds of GB of memory.

Bloom filter can solve the same problem only when the size of the hash table is 1/8 to 1/4.

Bloomfilter never misses any suspicious address in the blacklist. As for misjudgment, the common remedy is to create a small whitelist to store mail addresses that may not be misjudged.

5. Specific implementation of bloom-Filter

Problem example: give you two files a and B, each containing 5 billion URLs, each occupying 64 bytes, with a memory limit of 4 GB. Let you find, the URL of file B. What if there are three or even n files?

Based on this problem, we calculate the memory usage. 4G = 2 ^ 32 is about 4 billion * 8 is about 34 billion bit, n = 5 billion, if the error rate is 0.01, 65 billion bits are required. Currently, 34 billion is available, and there are not many differences. This may increase the error rate. In addition, if these URLs correspond one-to-one, you can convert them into IP addresses, which is much simpler.

Transferred from:

Http://blog.csdn.net/hguisu/article/details/7866173

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More