Spectral bloom filter (2)

Source: Internet
Author: User

The previous section describes how SBF stores counter. To achieve efficient counter storage, we first simplify the problem to see how many digits are needed to store all counters. Assume that SBF represents a set of M elements (which may contain duplicate elements), and the length of the counter array is m (corresponding to the bit array of the bloom filter ), obviously, the minimum number of digits N required by all counters is

 

CI indicates the size of the I-th counter in the counter array, that is, the number of times the hash function maps to the I-th digit. Storing counter with N bits is equivalent to converting all counters into binary strings and then connecting them together. In this case, the minimum number of digits is occupied, but how to access counter with different lengths is a big problem. In any case, without considering the addition or deletion operation, we want to achieve the goal of making the storage bits close to N as much as possible on the basis of ensuring fast query operations.

 

SBF has not invented any extraordinary superB skill. As you may have imagined, it builds a set of index structures. SBF divides the n-bit basic bits into M/logn segments, each segment contains N counters, and then records the offset of each segment. Since offset occupies the logn bits, the total length of the array (called coarse vector in this paper) that records the substring offset is m bits.

 

 

With the coarse vector, We can randomly access any substring. In this case, we have two options: to divide the substring into sub-segments, or to write down the offset of all counter in the sub-string (namely, ov, offset vector ). The substring is long, short, but contains the same number of counters, that is, the length of the Offset array for recording counter is the same, which means it is more cost-effective to use the eldest string to record the offset. SBF specifies that if the length of a substring exceeds log3n, the counter position is recorded directly using the offset array. Otherwise, the score will continue. N-bit basic bits contain N/log3n substrings whose length does not exceed log3n. Therefore, the total length of all the offset arrays in this layer is N/log3n × (logn × logn) = N/logn bits.

 

If the length of a sub-string does not exceed log3n, we will divide it into loglogn segments, each segment contains logn/loglogn counters. Because offset occupies the loglog3n = 3logloglogn bits, the total length of the entire offset array is 3 loglogn × logn/loglogn = 3logn bits. The maximum length of all offset arrays in this layer is m/logn × 3 logn = 3 M.

 

Not every sub-segment of a substring uses the offset array to store the counter position. As before, only long sub-segments are recorded. Assume that the sub-segment length is t, and the threshold value here is T0 = (loglogn) 3. When T> t0, the counter position of the sub-segment is recorded in the offset array. Because the sub-segment contains loglogn counters, and each offset can be expressed in 3loglogn bits, the length of the Offset array can be at most loglogn × 3 logloglogn = 3 (loglogn) 2 «T. The length of all the offset arrays in this layer cannot exceed O (n ).

 

Now T is less than or equal to t0. At this time, SBF does not continue to score, but stores all such situations in a global query table. I will not introduce this query table here. If you are interested, you can read the original paper. In short, the goal of SBF counter storage is to use only the O (n) + O (m) bits without considering addition or deletion operations, the build time is O (m ). Through the complex index structure built above, this goal is achieved. In the next section, we will see how the Add/delete operation can be implemented in this structure.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.