Bloom Filter of mass data processing algorithm

Last Update:2015-04-07 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Algorithm Introduction

The Chinese name of the Bloom filter is called the Bron filter, because his earliest name was called Bron (Bloom), and thus it was given. The Bron filter is simply to retrieve whether an element exists in a collection, so that the filtering of the data is realized. Perhaps you will think, this is not simple, to determine whether an element exists in a set, to traverse the collection, one after another to compare the results can be obtained, of course, there is no problem, but when you are faced with a huge amount of data, the cost of space and time is very scary, obviously need a better way to solve the problem, And the bloom filter is a good algorithm. How to implement it, then look down.

Bloom Filter

First of all, the traditional method of retrieving elements, such as storing a bunch of URL character arrays in memory beforehand, and then given a specified URL to determine whether it exists in the previous set, we must load the entire array into memory, and then compare Assuming that the average number of characters per URL is only a few bytes, but when the data becomes massive enough to burst the entire memory, this is a space limitation. Moreover, the successive traversal of the way itself is a violent search, the time of the search will be expanded linearly with the capacity of the collection itself, once the data is large, the query time overhead is very scary. For time and space problems, Bloom filter gives the perfect solution. First, the first space problem, the original data occupies a character, where we occupy 1 bits, that is to say 1 elements I use 1/8 bytes, regardless of your URL length is 10 characters, 100 characters, all with a bit representation, So here we need to be able to ensure that the bits represented by each character do not conflict. Because a bit of storage is used, we need to make a hash map of the data to get his position, then mark the position at 1 (which is 0 by default). So plainly, Bloom filter is made up of a very long array of bits and some random hash functions. Bit array You can imagine this form as follows:

You can imagine this length is very long, anyway 1 units occupy 1 bits, 1k space is already able to represent 1024*8=8192 bit. So the memory space has been greatly saved. Now a question has come, why I have just used some random hash function The word rather than say one, because there will be a hash collision, and the good hash function does not guarantee that there will not be a hash conflict, so here need to take a number of hash functions, So the judgment condition of whether the element exists is changed so that only the value of the location of all the hash function mappings is true, this element is present in the collection, so the accuracy of the judgment is greatly improved, after the hash map is as follows:

Suppose our program takes 3 random independent hash functions as shown, 1 elements need 3 different mapping algorithms for hash functions, 3 positions are marked, the probability of miscarriage of this element we do a calculation, to make this element false, that is, his 3 positions are occupied by someone, That is, there is a conflict with another hash function, and the worst case scenario is that his 3 mapping locations are completely overlapping with some other element through a hash function, assuming that the bit space is 1W bits long. The probability of each location being mapped is 1/1w, so the worst-case conflict probability is the 12-1/1W*1/1W*1/1W=1/10, if the probability of maximum collision probability, that is, each location with one of the hash function mapping conflict, the error probability is the superposition of the situation 1/1w +1/1w+1/1w=0.0003. The result is already very obvious, with 3 hash functions that have been able to guarantee a low enough miscalculation, not to mention when you use 4, 5 hash functions to map the situation. The following question has shifted to the way we use it as a bit array, an int array, a char array, and no answer. The result is below.

BitSet

This is a data type in Java, c,c++ I do not know that there is no such class, why choose this instead of the previously said int, or char array, first int of course not, 1 int itself has 32 bits, accounting for 4 bytes, with him to make 0, 1 of the storage is obviously the equivalent of not saving space, naturally we think of the character array char[], in the C language 1 char accounted for a byte, and in Java because of the different encoding, a char accounted for 2 bytes, with char storage is only slightly more than int introduced half of the space, And did not really do an element with a bit to represent, and later checked, Java inside there is built-in bitset dedicated to do bit storage, but also to do bit-related operations, his operations are in fact the same as the group, but also starting from 0. Unfamiliar classmates can go online to access relevant information, in fact, int array can also achieve similar functions, but to do the conversion, the int as 32-bit to calculate, before I wrote the relevant article, is about the bit diagram method to store big data .

implementation of the algorithm

The algorithm is actually very simple, I use a small set of data here to simulate.

Input data input.txt:

Mikestudydaygetlastexamthinkfishhe

The test data is then used for the query operation TestInput.txt:

Playmikestudydaygetaxislastexamthinkfishhe

In fact, I casually combination of some words.

The tool class of the algorithm Bloomfiltertool.java:

Package Bloomfilter;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import Java.util.arraylist;import Java.util.bitset;import Java.util.HashMap;import  java.util.map;/** * Fabric Filter Algorithm Tool class * * @author Lyq * */public class Bloomfiltertool {//bit array set to 10w bit length public static final int Bit_array_length = 100000;//Original document address private string filepath;//test document address private string testfilepath;//used to store bit array, A unit with 1 bits to store private BitSet bitstore;//raw data private arraylist<string> totaldatas;//test query data Private arraylist< String> querydatas;public Bloomfiltertool (String filePath, String testfilepath) {This.filepath = FilePath; This.testfilepath = Testfilepath;this.totaldatas = Readdatafile (this.filepath); This.querydatas = ReadDataFile ( This.testfilepath);} /** * Read data from a file */public arraylist<string> readdatafile (String path) {File File = new file (path); arraylist<string> DataArray = new arraylist<string> (); try {bufferedreader in = new BufferedreadER (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split (""); for (String Word:temparray) {DataArray . Add (Word);}} In.close ();} catch (IOException e) {e.getstacktrace ();} return DataArray;} /** * Get query total data * @return */public arraylist<string> Getquerydatas () {return this.querydatas;} /** * uses bits to store data */private void Bitstoredata () {Long hashcode = 0;bitstore = new BitSet (bit_array_length); for (String word:t Otaldatas) {//3 hash evaluation per word, reduced probability of hash collisions hashcode = Bkdrhash (word); Hashcode%= bit_array_length;bitstore.set ((int) Hashcode, true); hashcode = Sdbmhash (word); Hashcode%= bit_array_length;bitstore.set ((int) hashcode, true); hashcode = Djbhash (word); Hashcode%= bit_array_length;bitstore.set ((int) hashcode, True);}} /** * Data query, determine whether the original data in the presence of target query data */public map<string, boolean> querydatasbybf () {Boolean Isexist;long hashcode;int Pos1;int pos2;int pos3;//Query Terms of the case diagram map<string, boolean> word2exist = new hashmap<string, boolean> (); hashcode = 0;isexist = False;bitstoredata (); for (String word:querydatas) {isexist = False;hashcode = Bkdrhash (word);p o S1 = (int) (hashcode% bit_array_length) hashcode = Sdbmhash (word);p os2 = (int) (hashcode% bit_array_length); hashcode = D Jbhash (word);p OS3 = (int) (hashcode% bit_array_length), or only if there are only 3 hash locations present if (Bitstore.get (POS1) && Bitstore.get (POS2) && bitstore.get (POS3)) {isexist = true;} Save the results in Mapword2exist.put (Word, isexist);} return word2exist;} /** * Data query using a common filter method is to query */public map<string, boolean> querydatasbynf () {Boolean isexist = false;//query Terms of the case map map<string, boolean> word2exist = new hashmap<string, boolean> ();//traverse the way to find for (String Qword:querydatas) {i sexist = False;for (String word:totaldatas) {if (qword.equals (word)) {isexist = True;break;}} Word2exist.put (QWord, isexist);} return word2exist;}  /** * BKDR Word hash Algorithm * * @param str * @return */private long Bkdrhash (String str) {int seed = 31;/* 31 131 1313 13131 131313 etc.. */long hash = 0;int i = 0;for (i = 0; i < str.length (); i++) {hash = (hash * seed) + (Str.charat (i));} hash = Math.Abs (hash); return hash;} /** * SDB Word hash Algorithm * * @param str * @return */private long Sdbmhash (String str) {Long hash = 0;int i = 0;for (i = 0; i < Str.length (); i++) {hash = (Str.charat (i)) + (hash << 6) + (hash << +)-hash;} hash = Math.Abs (hash); return hash;} /** * DJB Word hash Algorithm * * @param str * @return */private long Djbhash (String str) {Long hash = 5381;int i = 0;for (i = 0; I &lt ; Str.length (); i++) {hash = ((hash << 5) + hash) + (Str.charat (i));} hash = Math.Abs (hash); return hash;}}

Scenario Test Class Client.java:

Package Bloomfilter;import Java.text.messageformat;import Java.util.arraylist;import java.util.map;/** * Bloomfileter Bron Filter Test class * * @author Lyq * */public class Client {public static void main (string[] args) {String FilePath = "C:\\users\\lyq\\desktop\\icon\\input.txt"; String Testfilepath = "c:\\users\\lyq\\desktop\\icon\\testinput.txt";//The total number of query words int totalcount;//The correct number of results int rightcount; Long StartTime = 0;long EndTime = 0;//bron filter query result map<string, boolean> bfmap;//general filter Query Results map<string, boolean> NF map;//Query total data arraylist<string> Querydatas; Bloomfiltertool tool = new Bloomfiltertool (FilePath, Testfilepath);//Use the method of the filter to query the word starttime = System.currenttimemillis (); bfmap = TOOL.QUERYDATASBYBF (); endTime = System.currenttimemillis (); System.out.println ("Bloomfilter algorithm time-consuming" + (Endtime-starttime) + "MS");//Use ordinary filters to query the word startTime = System.currenttimemillis (); nfmap = TOOL.QUERYDATASBYNF (); endTime = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Normal traversal query operation takes time" + (Endtime-starttime) + "MS "), Boolean isexist;boolean isexist2;rightcount = 0;querydatas = Tool.getquerydatas (); totalcount = Querydatas.size (); for (String Qword:querydatas) {//With the result of the traversed query as the standard result isexist = Nfmap.get (QWord); isExist2 = Bfmap.get (QWord); if (isexist = = is EXIST2) {rightcount++;} Else{system.out.println ("Pre-sentence the wrong word:" + QWord);}} System.out.println (Messageformat.format ("the correct number of Bloom filter is {0}, the total number of queries is {1}, the correct rate is {2}", Rightcount,totalcount, 1.0 * Rightcount/totalcount));}}

in the algorithm of the test Class I bloom filter and the normal traversal search method for a time performance comparison, when the amount of data is small, in fact, there is no gap, and even may be the length of the filter may take longer, such as my next Test results:

Bloomfilter algorithm time consuming 2ms normal traversal query operation time 0msBloom filter the correct number is 11, the total number of queries is 11, the correct rate of 1

But when I tested it with real test data, I cached the original data in a standard document and doubled the number of words in the query, and then executed the same program result to look like this:

Bloomfilter algorithm time consuming 16ms normal traversal query operation time 47msBloom filter the correct number is 2,743, the total number of queries is 2,743, the correct rate of 1

In fact, this is not enough to simulate the scene of massive data, for this result is not difficult to understand, the average brute force search, is related to the total amount of raw data, time complexity of O (n), and Bloom Filter, is the constant level, make a hash map OK, time complexity O (l),

Algorithm Summary

Algorithm in the implementation of the process encountered some small problems, the first is in the use of the hash function, because I randomly selected 3 characters hash function, and later found that will always cross, a cross-border value will become negative again through the Bitset will be error, originally in C language can be used unsigned int to solve , there is no such concept in Java, so it directly takes the absolute value of the hash. One of the characteristics of the Bloom filter algorithm is that the data may be misjudged, but it is absolutely not false negative, the error is to determine the elements not present in the set to have, the reason is that the hash conflict may cause this result, and false negative refers to the existence of the element is determined to not exist in the set, this is absolutely impossible, Because if you exist, the position you represent must be mapped to a hash, and once mapped, it will not be missed if you look again. The application of the algorithm is actually quite many, typical such as the filter of the junk email address.

Reference documents:

Bron Filter-Baidu Encyclopedia

http://blog.csdn.net/hguisu/article/details/7866173

My data mining algorithm:https://github.com/linyiqun/DataMiningAlgorithm

My algorithm library:https://github.com/linyiqun/lyq-algorithms-lib

Bloom Filter of mass data processing algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More