Bloomfilter algorithm de-weight based on Redis

Source: Internet
Author: User
Tags bitset redis cluster

Bloomfilter algorithm and its application scenario

  Bloomfilter is to store data using a bitmap-like or bit-set data structure, use a bit array to represent a collection succinctly, and quickly determine if an element is already present in the collection. Because the location of the data is calculated based on the hash, the Bloomfilter add and query operations are O (1). Because of the simplicity of storage, this data structure can use less memory to store massive amounts of data. So, are there any algorithms for this time and space? Of course not, Bloomfilter is its efficient (using hash) to bring its judgment is not necessarily correct, that is, the accuracy rate is not 100%. Because a good hash is a conflict, so the same position may be multiple times 1. In so doing, it is possible that a non-existent data will be mistaken for existence. But there must be some data to be judged. It is important to note here that the hash and HashMap are different, HashMap can use open addressing, link address method to resolve the conflict, because HashMap is key-value structure, is reversible, can be positioned. But the hash is irreversible, so the conflict cannot be resolved. Although Bloomfilter is not 100% accurate, the error rate can be reduced by adjusting the parameters, using the number of hash functions, and the size of the bit array. This adjustment can completely reduce the error rate to close to 0. Can satisfy most of the scenes.

  For Bloomfilter theory please refer to:

http://blog.csdn.net/jiaomeng/article/details/1495500

Https://en.wikipedia.org/wiki/Bloom_filter

  Applicable scenario: Bloomfilter is generally applicable to the high data volume of the accuracy requirements is not 100% of the deduplication scene.

Crawler link Weight: The Big crawler system has thousands of links to crawl, and need to ensure that the crawler links can not be recycled. This requires that the link list be de-weighed. Store the link hash in the Bitset, and then determine if it exists before crawling.

Website UV statistics: Generally the same user's multiple visits are to filter out, the general large-scale web site of the UV is huge, so use bloomfilter can be more efficient implementation.

Combined with Redis

  The Bloomfilter algorithm mentioned above is a stand-alone one, which can be implemented using the Bitset that comes with the JDK. But a system with large volumes of data is never a server, so multiple servers are needed to share. This requirement can be achieved perfectly with Redis's bitmap. With the high performance of Redis and the batch submission of multiple bit operation commands through pipeline, the bit data sharing of multi-machine bloomfilter is realized. The only thing to note is that the Redis bitmap only supports 2^32 size, which corresponds to memory, which is 512MB, and the index of the array can only be 2^32-1. However, this limitation can be distributed by hashing the bitmap by constructing multiple Redis models. One out of 10,000 of the miscarriage rate, 512MB can put down 200 million data.

Practice

  Using the two open source implementations on GitHub, it was implemented based on the JDK Bitset.

  Open Source code : Https://github.com/MagnusS/Java-BloomFilter

Https://github.com/Baqend/Orestes-Bloomfilter

  Test results (on-premises testing, time-consuming is time-consuming for each piece of data):

  

Then the source code was modified on the basis of Java-bloomfilter, and a test was done on a redis cluster with 5 nodes.

  Test results:

Initialization: 173070
Inserting data: 173070
Query data: 173070
Time: 350261ns
Memory: 326KB
Error Rate: 0%

You can see that the performance of the Bloomfilter algorithm combined with Redis is still relatively good.

  Redis+bloomfilter Test Source code: Https://github.com/wxisme/redis-bloomFilter

  

Bloomfilter algorithm de-weight based on Redis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.