Massive Data Processing-bloom Filter)

Source: Internet
Author: User
Tags bitset internet cache

Bloom filter was proposed by bloom in 1970 and was initially widely used in spelling checks and database systems. In recent years, with the development of computer and Internet technologies, the continuous expansion of data sets has led to the emergence of Bloom Filters, and various new applications and variants have emerged. Bloom filter is a space-efficient random data structure, which consists of a single-digit group and a group of hash ing functions. Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time far exceed the average algorithm. Its disadvantage is that it has a certain false recognition rate. Therefore, the bloom filter is not suitable for applications with zero errors. In applications that can tolerate low error rates, the bloom filter exchanges a small number of errors for a huge savings in storage space.

(1) instance comparison

Suppose you want to write a web crawler ). Due to the complexity of links between networks, spider crawling between networks is likely to form a "ring ". To avoid forming a "ring", you need to know the URLs that the spider has accessed. How do I know if a spider has accessed a URL? Think about the following solutions:

1. Save the accessed URL to the database.
2. Use hashset to save the accessed URL. You can check whether a URL has been accessed at the price close to O (1.
3. the URL is retained to the hashset or database after MD5 or SHA-1 hash.
4. Bit-map method. Create a bitset to map each URL to a bit through a hash function.

Method 1 ~ 3. All accessed URLs are completely saved. Method 4 only marks a ing bit of the URL. The above methods can solve the problem perfectly when the data volume is small, but the problem arises when the data volume becomes very large:

Disadvantage of Method 1: the efficiency of relational database query becomes very low when the data volume becomes very large. And every time a URL is sent, a database query is started, isn't it too trivial?
Disadvantage of Method 2: too much memory consumption. As the number of URLs increases, memory usage increases. Even if there are only 0.1 billion URLs, each URL is 50 characters long and requires 5 GB of memory.
Method 3: Because the digest length of the string after MD5 processing is only bits and SHA-1 processing is only 160bits, method 3 saves several times of memory than method 2.
Method 4 consumes a relatively small amount of memory, but the disadvantage is that the probability of a single hash function conflict is too high. Do you still remember the various solutions to hash table conflicts in data structure? To reduce the probability of a conflict to 1%, set the length of the bitset to 100 times the number of URLs.

In essence, the above algorithm ignores an important implicit condition: the error with a low probability is not necessarily 100% accurate! That is to say, a small number of URLs are not actually accessible to web crawlers, and the cost of misjudgment on them is very small-a big deal to catch a few webpages.

(2) bloom filter Definition

Bloom filter is an array of M-bit bits. It is initially 0 and has K separate hash functions.

(Note: the bit-map function used by the bloom filter with a single hash function is different in that the bloom filter uses k hash functions, and each string corresponds to k bits. This reduces the probability of conflict)

Add operation:
For each element, K hash functions are used to calculate k-sized hash vectors (H1, h2...hk) and set the bit corresponding to each hash value in the vector to 1. The time complexity is O (n). Generally, the time complexity of string hash functions is O (n ).

Query operation: similar to adding a partition, a hash vector is calculated first. If the bitwise of each hash value is 1, this element exists. The time complexity is the same as that of the add operation.

(3) false position

If an element is not in the bloom filter, but all its hash values are set to 1. This is false position, which is a false positive. Bloom filter allows this situation and is inevitable. We are concerned about the concept of false position and how to minimize it.

  1) Selection of Hash Functions
The selection of hash functions has a great impact on performance. A good hash function must map strings to each bit with an approximate probability. Selecting k different hash functions is troublesome. A simple method is to select a hash function and then input k different parameters.

  2) Bit Array Size Selection
There is a relationship between the number of hash functions K, the size of the Bit Array m, and the number of strings n. The related documents prove that for a given m, n, when k = ln (2) * M/N, the error probability is the minimum.

(4) advantages and disadvantages

Advantages:

Efficient query operations
Space saving
Easy to expand to parallel
Convenient Set Computing
Easy code implementation

Disadvantages:

There is a probability of misjudgment, that is, there is false position
Unable to obtain element data in the Set
The delete operation is not supported (deletion will affect other strings)

  1. Note: For Deletion operations not supported, the counting bloomfilter is now extended. This is a variant of the basic bloom filter, by changing each bit of the basic bloom filter to a counter, the CBMs can delete strings.

(5) Practical application

1) accelerated Query

It is applicable to some key-value storage systems. When values has a hard disk, querying is time-consuming. Insert all the stored data into the filter. If no query exists in the filter, you do not need to go to the storage query. When false position appears, it only causes an extra storage query. Its:

For example, Google's bigtable also uses bloom filter to reduce the queries of non-existent rows or columns on the disk, which greatly improves the performance of database query operations.
Many proxy-cache in Internet Cache Protocol use bloom filter to store URLs. In addition to efficient queries, it is also convenient to transmit and exchange cache information.

2) network applications

Search for resources in a P2P network. You can save the bloom filter for each network path. When hit, select this path for access.
When broadcasting messages, you can check whether a certain IP address has been packaged.
Detects the loop of the broadcast message package, stores the bloom filter in the package, and each node adds itself to the bloom filter.
Information queue management: Use counter bloom filter to manage information traffic.

3) spam address filtering

Public email providers such as Netease And QQ always need to filter spams from spamer. One way is to record the e-mail addresses of spam. Since the senders are constantly registering new addresses, the world seldom says that there are billions of spam addresses. Therefore, a large number of network servers are required to store them. If a hash table is used and each 0.1 billion email addresses are stored, 1.6 GB of memory is required (the specific implementation of the hash table is to convert each email address into an eight-character information fingerprint, then, store the information fingerprint to the hash table. Because the storage efficiency of the hash table is generally only 50%, an email address needs to occupy 16 bytes. The 0.1 billion addresses are about 1.6 GB, that is, 1.6 billion bytes of memory ). Therefore, storing billions of email addresses may require hundreds of GB of memory. Bloom filter can solve the same problem only when the size of the hash table is 1/8 to 1/4. Bloom filter never misses any suspicious address in the blacklist. As for misjudgment, the common remedy is to create a small whitelist to store mail addresses that may not be misjudged.

(6) code implementation

For the code implementation, see: http://blog.csdn.net/forestlight/article/details/6839180, this article provides a more detailed implementation case, which has several difficulties:

1) bloom * bloom_create (size_t size, size_t nfuncs ,...); this is a variable parameter function that involves variable parameters, such as vg_list, vg_start, vg_arg, and vg_end, refer to the previous article "Introduction to variable parameters in C and C ++ -- (va_list, va_start, va_arg, va_end )".

2) For typedef unsigned int (* hashfunc_t) (const char *), this is a pointer of the function type. You can refer to the previously written "pointer function vs function pointer ".

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.