Large-volume data processing tool of the cloth long filter

Last Update:2014-12-21 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

See a huge amount of data to go to the heavy, to find the longest stay of IP and other issues, Bo friends mentioned Bloom Filter, I checked the check, but the first thought is the uncle, the following first look at the uncle's demeanor.

First, the introduction of Bron filter concept

(Bloom Filter) was proposed by Bron (Burton Howard Bloom) in 1970. It is actually made up of a long binary vector and a series of random mapping functions, and the Bron filter can be used to retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far more than the general algorithm, the disadvantage is that there is a certain rate of false recognition (false positive example false positives, that is, Bloom filter reports that an element exists in a set, but in fact the element is not in the collection) and delete difficulties, However, there is no case to identify the error (that is, false inverse negatives, if an element is indeed in the collection, then Bloom Filter will not report that the element does not exist in the collection, so it will not be omitted).

the following from a simple sort of talk about the bitmap algorithm, and then talk about data deduplication problem, talking about big Data processing weapon: Bron filter.

Sort data with no duplicates

How does the given data (2,4,1,12,9,7,6) sort it?

Method 1: Basic sorting methods include bubbling, quick-row, etc.

Method 2: Use the bitmap algorithm

Method 1 does not introduce, the so-called Bitmap in Method 2 is a bit array, and the only difference between the array used in the usual is the bitwise of the operation. the first is to open a 2 byte array of bits, length 12 (the length is determined by the largest number 12 of the above data), and then, read the data, 2 is stored in the place of the array labeled 1, the value from 0 to 1,4 stored in the subscript 3, the value from 0 to 1 .... Finally, read the bit array and get the sorted data: (1,2,4,6,7,9,12).

The difference between Method 1 and Method 2 is compared: In Method 2, the time complexity and spatial complexity required for sorting are dependent on the largest number in the data, such as 12, so it is necessary to open 2 bytes of memory in space, and the time needs to traverse the complete array. When the data is similar (1,1000,10 million) only 3 data, obviously using method 2, time complexity and spatial complexity is quite large, but when the data is more dense, the method will show the advantage.

The weighing of duplicate data

How does the data (2,4,1,12,2,9,7,6,1,4) find the number that recurs?

The first is an array of 2 byte-sized bits with a length of 12 (the length is determined by the largest number 12 in the above data, when you have finished reading 12, when reading 2, it is found that the value in the array is 1, then the 2 is judged to be repeated.

Second, Bron filter principle

The bron filter requires a bit array (this is a bit like a bitmap) and a K mapping function (similar to a hash table), and in the initial state, for an array of bit arrays of length m, all its bits are set to 0. for a set of n elements s={s1,s2......sn}, the K mapping function {f1,f2,...... FK}, each element in the set S SJ (1<=J<=N) is mapped to K value {G1,G2......GK} , and then set the corresponding Array[g1],array[g2]......array[gk in the array of bits to 1, and if you are looking for an element in S, K values are obtained through the mapping function {F1,F2.....FK}. {G1,G2.....GK}, and then determine if ARRAY[G1],ARRAY[G2]......ARRAY[GK] is all 1, if all is 1, then item is in S, otherwise item is not in S. This is the implementation principle of the Bron filter.

Of course, some readers may ask: even if ARRAY[G1],ARRAY[G2]......ARRAY[GK] is 1, can you represent item must be in set S? Not necessarily, because there is the possibility that a number of elements in the collection can be calculated by mapping the values that happen to include G1,G2,..... GK, which may cause a miscarriage, but this probability is very small, generally under one out of 10,000.

Obviously, the false rate of Bron filter is related to the design of k mapping function, so far, many people have designed a lot of efficient and practical hash function. in particular, it is important to note that the Bron filter is not allowed to delete the element (in fact, because more than one STR may be set at the same point, and the existence of STR is determined that all mapping points exist, so can not be deleted), because if an element is deleted, false negative can occur. However, there is a variant of the counter filter Bloom filter, which can support the deletion of elements, interested readers can consult the relevant literature.

Three, Bron filter false positives probability deduction

now detects if an element is in the collection. Indicating whether an element is required in the collection in the K position is set to "1" as above, but this method may cause the algorithm to think incorrectly that an element that is not in the collection is detected as being in the set (False positives), the probability is determined by the following formula: Span style= "font-family: Chinese in italics; Font-size:x-large; " >

in fact, the above results are assumed to be calculated by each Hash to set the location of the bit (bit) is independent of the premise of the calculation, it is not difficult to see, with the increase in m (bit array size), the probability of false positive example (false positives) will decline, while inserting the number of elements n The increase, false positives probability will rise, for a given m,n, how to choose the number of hash function k is determined by the following formula:; at this point the probability of false positives is:; and for a given false positives probability p, how to choose the optimal bit array size m, The formula indicates that the size of the bit array is preferably linearly related to the number of elements inserted, for a given m,n,k, The maximum probability of false positive cases is:.

Four, Bron filter application

Bron Filter In many occasions can play a very good effect, such as: Web page URL deduplication, spam detection, collection of repeating elements of the discrimination, query acceleration (such as based on Key-value storage system), and so on, here are a few examples:

There are two URL sets, A, b, about 100 million URLs in each collection, 64 bytes per URL, 1G of memory, and how to find the duplicate URLs in two collections.

Obviously, the direct use of the hash table is beyond the limits of memory. Here are two ways of thinking:

The first kind: If a certain error rate is not allowed, only with the idea of division of the resolution, A, a, a two of the set of the URL is stored in several files {F1,F2...FK} and {G1,g2....gk}, and then take F1 and G1 content read into memory, F1 content stored in the hash _map, and then take G1 in the URL, if there is the same URL, then write to the file, and then until the G1 content read, then take G2...gk. And then fetch the contents of F2 read into memory ... And so on, know to find all the duplicate URLs.

The second type: If you allow a certain error rate, then you can use the idea of the cloth filter.

In the web crawler, there is a very important process is the identification of duplicate URLs, if all the URLs are stored in the database, when the number of URLs in the database

volume is A long time, when the weight will be inefficient, this is a common practice is to use the filter, there is also a way to use Berkeley db to store Url,berkeley db is a key-value storage-based non-relational database engine, Can greatly improve the efficiency of URL weighing.

Bron filters are used to filter malicious URLs, all malicious URLs are built on a fabric filter, and then the user's access to the URL to detect, if in a malicious URL, then notify the user. In this case, we can also set a white list of the URLs that often make mistakes, and then match the URLs that appear to be judged and the URLs in the whitelist, if they are in the whitelist, then release them. Of course, this white list can not be too big, nor too big, the probability of Bron filter error is very small.

Five, Bron filter simple Java implementation

Package A;import java.util.bitset;/* * Existing problems * Default_len length is set to how much appropriate? * I found that result was related to Default_len, no, no, No. */public class Bloomfiltertest {//30-bit, representing 2^2^30 type of character static int default_len = 1< <30;//to use prime number static int[] seeds = {3,5,7,11,17,31};static BitSet BitSet = new BitSet (Default_len); Static myhash[] Myselfhash = new Myhash[seeds.length];p ublic static void Main (string[] args) {String str = "[EMAIL&NBSP;PR Otected] ";//Generate once enough for (int i=0; i<seeds.length; i++) {Myselfhash[i] = new Myhash (Default_len, seeds[i]);} Bitset.clear (); for (int i=0; i<myselfhash.length; i++) {bitset.set (Myselfhash[i].myhash (str), true);} Boolean flag = Containsstr (str);//SYSTEM.OUT.PRINTLN ("========================"); SYSTEM.OUT.PRINTLN (flag);} private static Boolean containsstr (String str) {//TODO auto-generated method stubif (NULL==STR) return false;for (int i=0; i <seeds.length; i++) {if (Bitset.get (Myselfhash[i].myhash (str)) ==false) return false;} return true;}} Class Myhash {int len;int seed;public myhash (int len,int seed) {super (); This.len = Len;this.seed = Seed;} public int Myhash (String str) {int len = str.length (); int result = 0;//This len is the Len of STR, not the lenfor of the member variable (int i=0; i<len; i++ {//system.out.println (seed+ "oooooooooooo"); result = Result*seed + Str.charat (i);//system.out.println (result);// Length is 1<<24, if greater than the number of sensory results inaccurate//<0 is greater than 0x7ffffffif (result> (1<<30) | | result<0) {// SYSTEM.OUT.PRINTLN ("-----" + (1<<30)); System.out.println (result+ "myhash data out of bounds!!! "); break;}} Return (len-1) &result;}}

Large-volume data processing tool of the cloth-long filter

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More