Bron Filter--a data structure with high spatial efficiency

Source: Internet
Author: User

First, talk about hashing

1.1 Principle

Hash (hash, or hashing) functions are widely used in the field of computer science, especially in the field of fast data searching.

The effect is to map a large data set onto a small dataset (these small datasets are called hashes, or hash values).

1.2 A typical hash function

1.3 Features

If the two hash values are not the same (depending on the same function), then the original input of the two hashes is not the same.
The input and output of the hash function are not unique, and if the two hash values are the same, the two input values are likely to be the same. But it may also be different, which is known as a "hash collision" (or "hash conflict").

1.4 Disadvantages

Citing Dr. Wu's "Mathematical Beauty", the space efficiency of a hash table is still not high enough. If you use a hash table to store 100 million spam addresses, each email address corresponds to 8bytes, and the hash table storage efficiency is generally only 50%, so an email address needs to occupy 16bytes. So 100 million email addresses take up 1.6GB, and hundreds of gigabytes of memory are required if you store billions of email address. A generic server cannot be stored unless it is a supercomputer.

Second, Bron filter

2.1 Principle

If you want to determine whether an element is in a collection, it is common to think of saving all the elements in the collection and then determining by comparison. Lists, trees, hash tables (also known as hash tables, hash table) and other data structures are this way of thinking. But as the elements in the collection increase, we need more storage space. At the same time, the retrieval speed is getting slower.

Bloom filter is a kind of spatial efficient random data structure, Bloom filter can be regarded as the extension of bit-map, its principle is:
When an element is added to a collection, the element is mapped to a K-point in a bit array by a K-Hash function, which is set to 1. When retrieving, we just have to look at whether these points are all 1 (about) knowing that there is no it in the collection:
If these points have any one 0, then the retrieved element must not be;
If all is 1, then the retrieved element is likely to be in.

2.2 Advantages

Its advantage is that space efficiency and query time are far more than the general algorithm, Bron filter storage space and insert/query time are constant O (k). In addition, the hash function is not related to each other, which is convenient for hardware parallel implementation. The Bron filter does not need to store the elements themselves and has advantages in some cases where confidentiality requirements are very stringent

2.3 Disadvantages

But the disadvantages and advantages of the Bron filter are just as obvious. The error rate is one of them. As the number of elements deposited increases, the error rate increases. But if the number of elements is too small, the use of a hash table is sufficient.
(The remedy for miscarriage of judgment is to create a small white list that stores information that may be misjudged.) )
In addition, it is generally not possible to remove elements from the Bron filter. It is easy to think of turning the bit array into an array of integers, each inserting an element corresponding to the counter plus 1, so that when the element is deleted, the counter is lost. However, it is not so easy to ensure that elements are safely removed. First we must make sure that the deleted elements are indeed inside the Bron filter. This is not guaranteed by this filter alone. In addition, the counter wrapping can also cause problems.

2.4 Usage scenarios for the fabric-long filter

Google Chrome uses Bloom filter to identify malicious links

Detecting junk e-mail
Assuming we store 100 million e-mail addresses, we first set up a 1,600,000,002 binary (bit), or 200 million-byte vector, and then all of the 1.6 billion binary zeros. For each e-mail address X, we use eight different random number generator (F1,F2, ..., F8) to generate eight information fingerprints (F1, F2, ..., F8). Then using a random number generator G to map these eight information fingerprints to eight natural numbers from 1 to 1.6 billion G1, G2, ..., G8. Now let's set the binary of all eight locations to one. When we do this with all 100 million email addresses. A filter for these email addresses was built.

A, b two files, each store 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4G, let you find the common URL of a, b file. What if it's three or even n files?

Analysis: If you allow a certain error rate, you can use Bloom filter,4g memory to probably represent 34 billion bit. Map the URLs in one of the files to these 34 billion bits using the Bloom filter, then read the URL of the other file one at a time, check if it is with Bloom filter, and if so, then the URL should be a common URL (note that there will be a certain error rate). ”

Bron Filter--a data structure with high spatial efficiency

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.