Use the bloom filter algorithm to process large-scale data filtering.

Source: Internet
Author: User
Tags bitset

Use the bloom filter algorithm to process large-scale data filtering.

Bloom Filter is a fast search algorithm proposed by Bloom in 1970. It uses multiple hash algorithms to jointly determine whether an element is in a set. It can be used to repeatedly filter Web crawlers and spam.

Compared with the hash container, it does not need to store the actual data of elements into the container to compare whether they exist one by one.
Only the corresponding bit segments are required to mark whether the data exists. Therefore, it is especially suitable for massive data processing to save memory. In addition, because the storage elements and comparison operations are not saved, the performance is much higher than that of the hash container.

However, because the bloom filter does not compare elements, it only uses multiple hash to determine the uniqueness. Therefore, a certain hash conflict exists, leading to misjudgment. The error determination rate is determined by the number of hash functions, the advantages and disadvantages of hash functions, and the size of storage space.

It is also difficult to delete it. The solution is to use its variant, the bloom filter with the count, which is not mentioned here.

The implementation of the bloom filter algorithm is quite simple:
First, allocate a fixed continuous space with m bits (m/8 + 1 byte) and then provide k different hash functions, calculate the bit index for each element at the same time. If the bitwise of each index is 1, this element exists. Otherwise, this element does not exist.

It can be seen that, if the judgment is nonexistent, it certainly does not exist. Only when the judgment is existence will there be a false judgment.

The main difficulty of bloom filter lies in estimation:


How many hash Functions and storage space are required when the error rate is specified.

First, let's look at the formula for calculating the false positive rate of the bloom filter:

Assuming that there are k hash functions, m bit storage space, and n set elements, there is a false positive rate p:

P = (1-(1-1/m) ^ kn) ^ k ~ = (1-e ^ (-kn/m) ^ k

According to this, the official Formula for Calculating k's optimal solution is provided to minimize the storage space when it satisfies the given p:

K = (m/n) * ln2

Bring it into the probability formula to get:

P = (1-e ^-(m/nln2) n/m) ^ (m/nln2)

Simplified:

Lnp =-m/n * (ln2) ^ 2

Therefore, if p is specified, the optimal solution can be obtained by satisfying the formula:

S = m/n =-lnp/(ln2 * ln2) =-log2 (p)/ln2
K = s * ln2 =-log2 (p)

Theoretical Value:

P <0.1: k = 3.321928, m/n = 4.79
P <0.01: k = 6.643856, m/n = 9.58
P <0.001: k = 9.965784, m/n = 14.37
P <0.0001: k = 13.287712, m/n = 19.170117

It can be seen that this can minimize the storage space while ensuring the false positive rate, but the number of hash functions used is k
A relatively large number, at least four, must meet p <0.001, requires 10, this for string hash computing, performance loss is considerable, it is unacceptable in actual use.

Therefore, we need another formula to calculate the space size using s = m/n when p and k are specified, flexibility is greatly improved.

Next, let's take a look at the formula I pushed out. First, we should use the false positive rate formula:

P = (1-e ^ (-kn/m) ^ k

Assume s = m/n, then there is

P = (1-e ^ (-k/s) ^ k

Perform the import on both sides to obtain the following information:

Lnp = k * ln (1-e ^ (-k/s ))

Exchange k:

(Lnp)/k = ln (1-e ^ (-k/s ))

Re-upload e:

E ^ (lnp)/k) = 1-e ^ (-k/s)

Simplification:

E ^ (-k/s) = 1-e ^ (lnp)/k) = 1-(e ^ lnp) ^ (1/k) = 1-p ^ (1/k)

Further export:

-K/s = ln (1-p ^ (1/k ))

It is concluded that:

S =-k/ln (1-p ^ (1/k ))

Suppose 'C = p ^ (1/k )':

S =-k/ln (1-c)

Using Taylor's expansion: 'ln (1 + x )~ = X-0.5x ^ 2 while x <1 'is simplified:

S ~ =-K/(-c-0.5c ^ 2) = 2 k/(2c + c * c)

Finally, the formula is obtained:

C = p ^ (1/k)
S = m/n = 2 k/(2c + c * c)

Assume that there is n = 10000000 of the data volume, there is a theoretical value:

P <0.1 and k = 1: s = m/n = 9.523810
P <0.1 and k = 2: s = m/n = 5.461082
P <0.1 and k = 3: s = m/n = 5.245850, space ~ = 6.3 MB
P <0.1 and k = 4: s = m/n = 5.552045, space ~ = 6.6 MB

P <0.01 and k = 1: s = m/n = 99.502488
P <0.01 and k = 2: s = m/n = 19.047619
P <0.01 and k = 3: s = m/n = 12.570636, space ~ = 15 MB
P <0.01 and k = 4: s = m/n = 10.922165, space ~ = 13 MB

P <0.001 and k = 1: s = m/n = 999.500250
P <0.001 and k = 2: s = m/n = 62.261118
P <0.001 and k = 3: s = m/n = 28.571429, space ~ = 34 MB
P <0.001 and k = 4: s = m/n = 20.656961, space ~ = 24.6 MB

P <0.0001 and k = 1: s = m/n = 9999.500025
P <0.0001 and k = 2: s = m/n = 199.004975
P <0.0001 and k = 3: s = m/n = 63.167063, space ~ = 75.3 MB
P <0.0001 and k = 4: s = m/n = 38.095238, space ~ = 45.4 MB
P <0.0001 and k = 5: s = m/n = 29.231432, space ~ = 24.8 MB

We can see that k = 3 is actually within the acceptable range of our usual use, and there is no need
Use the optimal solution, unless the space usage is extremely demanding, and this formula is more flexible and intelligent for the adjustment of the space usage of the program.

In particular, if the implementation of each hash is excellent and the distribution is very uniform, the actual false positive rate is much lower than the theoretical value:

Take the TBOX's bloom filter Implementation for testing, n = 10000000:

The actual false positive rate of p <0.01 and k = 3 is: 0.004965
The actual false positive rate of p <0.001 and k = 3 is: 0.000967

Therefore, a good hash function algorithm is particularly important.

Next let's take a look at the use of the bloom filter provided by TBOX, which is quite convenient to use:

// The total number of elements tb_size_t count = 10000000;/* initialize bloom filter ** TB_BLOOM_FILTER_PROBABILITY_0_01: predefined false positive rate, close to 0.01 * Note: Expressed by internal displacement: 1/2 ^ 6 = 0.015625 ~ = 0.01 * the false positive rate actually passed in may be slightly larger, but it is still quite close to ** 3: K value, number of hash functions, up to 15 ** count: number of specified Element scales ** tb_item_func_long (): the element type of the container. It is mainly used by its built-in hash function. to customize the hash function, you can replace it: ** tb_size_t tb_xxxx_hash (tb_item_func_t * func, tb_cpointer_t data, tb_size_t mask, tb_size_t index) * {* // The mask is the hash mask, index is the index * return compute_hash (data, index) & mask; *} ** tb_item_func_t func = tb_item_func_long (); * func. Hash = tb_xxxxxx_hash; ** for */tb_bloom_filter_ref_t filter = tb_bloom_filter_init (bytes, 3, count, tb_item_func_long (); if (filter) {tb_size_t I = 0; for (I = 0; I <count; I ++) {// generate a random number tb_long_t value = tb_random (); // set the value to the filter. If it does not exist, if (tb_bloom_filter_set (filter, (tb_cpointer_t) value) {// The element is successfully added, the previous element does not exist // No false positives} else {// failed to add. The added element already exists. There may be misjudgment here // only determine whether the element exists if (tb_bloom_filter_get (filter, (tb_cpointer_t) data) {// element already exists // here there may be misjudgment} else {// element does not exist // No misjudgment} // exit filter tb_bloom_filter_exit (filter );} // you can specify other values for the pre-defined false positive rate. Note: The value must be a displacement value instead of the actual value typedef enum _ tb_bloom_filter_probability_e {TB_BLOOM_FILTER_PROBABILITY_0_1 = 3 ///! <1/2 ^ 3 = 0.125 ~ = 0.1, TB_BLOOM_FILTER_PROBABILITY_0_01 = 6 ///! <1/2 ^ 6 = 0.015625 ~ = 0.01, TB_BLOOM_FILTER_PROBABILITY_0_001 = 10 ///! <1/2 ^ 10 = 0.0009765625 ~ = 0.001, TB_BLOOM_FILTER_PROBABILITY_0_0001 = 13 ///! <1/2 ^ 13 = 0.0001220703125 ~ = 0.0001, TB_BLOOM_FILTER_PROBABILITY_0_00001 = 16 ///! <1/2 ^ 16 = 0.0000152587890625 ~ = 0.00001, TB_BLOOM_FILTER_PROBABILITY_0_000001 = 20 ///! <1/2 ^ 20 = 0.00000095367431640625 ~ = 0.000001} tb_bloom_filter_probability_e;

 


Which one can give me a compiled source program that implements the bloom filter Using java?

Public class SimpleBloomFilter {
Private static final int DEFAULT_SIZE = 2 <24;
Private static final int [] seeds = new int [] {5, 7, 11, 13, 31, 37, 61 };
Private BitSet bits = new BitSet (DEFAULT_SIZE );
Private SimpleHash [] func = new SimpleHash [seeds. length];

Public static void main (String [] args ){
String value = "stone2083@yahoo.cn ";
SimpleBloomFilter filter = new SimpleBloomFilter ();
System. out. println (filter. contains (value ));
Filter. add (value );
System. out. println (filter. contains (value ));
}

Public SimpleBloomFilter (){
For (int I = 0; I <seeds. length; I ++ ){
Func [I] = new SimpleHash (DEFAULT_SIZE, seeds [I]);
}
}

Public void add (String value ){
For (SimpleHash f: func ){
Bits. set (f. hash (value), true );
}
}

Public boolean contains (String value ){
If (value = null ){
Return false;
}
Boolean ret = true;
For (SimpleHash f: func ){
Ret = ret & bits. get (f. hash (value ));
}
Return ret;
}

Public static class SimpleHash {
Private int cap;
Private int seed;

Public SimpleHash (int cap, int seed ){
This. cap = cap;
This. seed = seed;
}

Public int hash (String value ){
Int result = 0;
Int len = value. length ();
For (int I = 0; I <len; I ++ ){
Result = seed * result + value. charAt (I );
}
Return (cap-1) & result;
}
}
}

Hope to help you... the remaining full text>
 
Write a program in C language to find the same substring in the two strings that the user inputs.

// Use the classic big data processing algorithm bloomfilter to search for identical elements in two sets, removing duplicates # include <stdio. h> # include <string. h> unsigned char mask [8] = {128, 64, 32, 16, 8, 4, 2, 1}; // simple hash algorithm int hashfuc (char * s, int key) {int I, seed [4] = {5, 7, 11, 13}, value = 0; if (key> = 4) key % = 4; for (I = 0; s [I]; I ++) value + = s [I] * seed [key]; return value ;} // use the bloomfilter algorithm to map string s to the array m, and remove the repeated substring void bloomfilter (unsigned char * m, char * s) {int I, j, h Value, brepeat; char substr [32]; for (I = j = 0; I ++) {if (s [I]! = ''& S [I]! = '\ T' & s [I]! = 0) substr [j ++] = s [I]; else {substr [j] = 0; brepeat = 1; for (j = 0; j <4; j ++) {hvalue = hashfuc (substr, j) & 0X7F; if (m [hvalue> 3] & mask [hvalue & 7]) = 0) {m [hvalue> 3] | = mask [hvalue & 7]; brepeat = 0 ;}// if it is a duplicate substring if (brepeat = 1) {j = strlen (substr); strncpy (s + I-j, s + I + 1, strlen (s)-I ); // printf ("Duplicate substring % s with deduplication: % s \ n", substr, s); I = I-j-1 ;} if (s [I] = 0) break; j = 0 ;}} int main () {char s1 [256], s2 [256], substr [32]; int I, j, hvalue; unsigned char m1 [16] = {0}, m2 [16] = {0}, m3 [16]; printf ("First string \ n"); gets (s1); printf ("Second str ...... remaining full text>
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.