Bloom filter for massive data processing algorithms

Source: Internet
Author: User
Tags bitset time 0
Algorithm Introduction: The Chinese name of BloomFilter is Bloom filter, because his earliest author is Bloom (Bloom. Bloom filter is simply used to retrieve whether an element exists in a set and filter data. You may think that this is not simple. You can determine whether an element exists in a collection and traverse the set,

Algorithm Introduction: The Chinese name of the Bloom Filter is Bloom Filter, because its earliest author is Bloom (Bloom. Bloom filter is simply used to retrieve whether an element exists in a set and filter data. You may think that this is not simple. You can determine whether an element exists in a collection and traverse the set,

Algorithm Introduction

The Chinese name of Bloom Filter is Bloom Filter, because its earliest author is Bloom (Bloom. Bloom filter is simply used to retrieve whether an element exists in a set and filter data. You may think that this is not simple. It is not easy to judge whether an element exists in a collection and traverse the set. The result cannot be obtained after comparison. Of course, there is no problem, however, when you are dealing with massive data volumes, the cost of space and time is terrible. Obviously, a better solution is needed to solve this problem, bloom Filter is a good algorithm. Next, let's look at how to implement it.

Bloom Filter

Let's talk about the traditional method of element retrieval. For example, we store a bunch of url character arrays in the memory, and then specify a specified url to determine whether it exists in the previous collection, we must load the entire array into the memory, and then compare them one by one. Assume that the average size of each url character is only a few bytes, but when the data changes to a massive volume, it is enough to overwrite the entire memory, which is a space limitation. Furthermore, the method of successive traversal is itself a type of brute force search. The search time will linearly expand with the set capacity. Once the data size increases, the query time overhead is also terrible. Bloom Filter provides a perfect solution to time and space problems. First of all, the first space problem is that the original data occupies characters. Here we use 1 bit to occupy, that is to say, 1 element is represented in 1/8 bytes, no matter whether your url is a string of 10 or 100 characters, it is represented by a single character. Therefore, we need to ensure that the bit represented by each character cannot conflict. Because the bit storage is used, we need to perform a hash ing on the data to obtain its location, and then mark the location on this location as 1 (the default value is 0 ). To put it bluntly, Bloom Filter is composed of a long array of digits and random hash functions. You can imagine this form as a bit array:

As you can imagine, this length is very long. One unit occupies one position, and 1 K space can already represent 1024*8 = 8192 bits. Therefore, memory space is greatly reduced. Now I have a question. Why do I just use some random hash functions instead of just saying one? as there is a hash collision, even better hash functions cannot guarantee that there will be no hash conflicts, therefore, multiple hash functions are required. Therefore, the condition for determining whether an element exists is changed to true only when the values mapped by all hash functions are true, this element exists in the Set, which greatly improves the accuracy of judgment. The hash ing is as follows:

Assume that our program uses three random and independent hash functions, as shown in. One element needs to map three different hash functions and mark the three positions, we need to calculate the probability of misjudgment on this element. To make this element misjudgment, that is to say, all three of its locations are occupied, that is, they all conflict with other hash functions, the worst case is that its three ing locations completely overlap with a certain other element through hash function calculation, assuming that the bit space length is 1 W. The probability that each location is mapped is 1/1 w, so the worst case conflict probability is 1/1 w * 1/1 w * 1/1 w = 12 power of 1/10, if the maximum probability of conflict is that each location maps to a hash function in it, the error probability is 1/1 w + 1/1 w + 1/1 w = 0.0003. The result is already very obvious. Three hash functions can ensure a low false positive rate, let alone when you use four or five hash functions for ing. The following question is also transferred to the method we use as a bit array, int array, character char array, the answer is not. The result is as follows.

BitSet

This is a data type in java. I do not know whether there is such a class at present. Why should I choose this type instead of the int or char array mentioned above? First, the int type won't work, of course, A single int has 32 characters and occupies 4 bytes. The storage with its characters is obviously equivalent to no space saved. Naturally, we thought of using the character array char [], in C, one char occupies one byte. in java, a char occupies two bytes due to different encoding methods, using char for storage is just a little more than half of the Space introduced by int. It does not really mean that an element is represented by a single bit. Later I checked it, java has built-in BitSet for bit storage, and can perform many bit-related operations. Its operations are actually the same as arrays and start from 0. Unfamiliar users can access relevant information on their own. In fact, the int array can also implement similar functions, but they need to convert the int into 32 bits, I have written related articles aboutMap to store big data.

Algorithm Implementation

The algorithm is actually very simple. Here I use a small set of data for simulation.

Input data input.txt:

mikestudydaygetlastexamthinkfishhe
Then the test data is used for the query operation testinput.txt:

playmikestudydaygetAxislastexamthinkfishhe
In fact, it is some words that I casually combine.

Algorithm tool class BloomFilterTool. java:

Package BloomFilter; import java. io. bufferedReader; import java. io. file; import java. io. fileReader; import java. io. IOException; import java. util. arrayList; import java. util. bitSet; import java. util. hashMap; import java. util. map;/*** bloom filter algorithm tool class ** @ author lyq **/public class BloomFilterTool {// set the length of the bit array to 10 W. public static final int BIT_ARRAY_LENGTH = 100000; // original document address: private String filePath; // test document address: private String testFilePath; // A Bit Array for storage. One unit stores private BitSet bitStore in one place; // raw data private ArrayList
 
  
TotalDatas; // test the query data private ArrayList
  
   
QueryDatas; public BloomFilterTool (String filePath, String testFilePath) {this. filePath = filePath; this. testFilePath = testFilePath; this. totalDatas = readDataFile (this. filePath); this. queryDatas = readDataFile (this. testFilePath);}/*** read data from the file */public ArrayList
   
    
ReadDataFile (String path) {File file = new File (path); ArrayList
    
     
DataArray = new ArrayList
     
      
(); Try {BufferedReader in = new BufferedReader (new FileReader (file); String str; String [] tempArray; while (str = in. readLine ())! = Null) {tempArray = str. split (""); for (String word: tempArray) {dataArray. add (word) ;}} in. close ();} catch (IOException e) {e. getStackTrace ();} return dataArray;}/*** obtain the total query data * @ return */public ArrayList
      
        GetQueryDatas () {return this. queryDatas;}/*** BIT data storage */private void bitStoreData () {long hashcode = 0; bitStore = new BitSet (BIT_ARRAY_LENGTH); for (String word: totalDatas) {// hash each word three times to reduce the probability of hash conflicts hashcode = BKDRHash (word); hashcode % = BIT_ARRAY_LENGTH; bitStore. set (int) hashcode, true); hashcode = SDBMHash (word); hashcode % = BIT_ARRAY_LENGTH; bitStore. set (int) hashcode, true); hashcode = DJBHash (word); hashcode % = BIT_ARRAY_LENGTH; bitStore. set (int) hashcode, true) ;}/ *** to query data and determine whether target query data exists in the original data */public Map
       
         QueryDatasByBF () {boolean isExist; long hashcode; int pos1; int pos2; int pos3; // Map of the query term
        
          Word2exist = new HashMap
         
           (); Hashcode = 0; isExist = false; bitStoreData (); for (String word: queryDatas) {isExist = false; hashcode = BKDRHash (word); pos1 = (int) (hashcode % BIT_ARRAY_LENGTH); hashcode = SDBMHash (word); pos2 = (int) (hashcode % BIT_ARRAY_LENGTH); hashcode = DJBHash (word); pos3 = (int) (hashcode % BIT_ARRAY_LENGTH); // if (bitStore. get (pos1) & bitStore. get (pos2) & bitStore. get (pos3) {isExist = true;} // Save the result to mapword2exist. put (word, isExist);} return word2exist;}/*** the common filter method is to query data one by one */public Map
          
            QueryDatasByNF () {boolean isExist = false; // query the Map of a word
           
             Word2exist = new HashMap
            
              (); // Search for (String qWord: queryDatas) {isExist = false; for (String word: totalDatas) {if (qWord. equals (word) {isExist = true; break ;}} word2exist. put (qWord, isExist);} return word2exist;}/*** BKDR character hash algorithm ** @ param str * @ return */private long BKDRHash (String str) {int seed = 31;/* 31 131 1313 13131 131313 etc .. */long hash = 0; int I = 0; for (I = 0; I <str. length (); I ++) {hash = (hash * seed) + (str. charAt (I);} hash = Math. abs (hash); return hash;}/*** SDB character hash algorithm ** @ param str * @ return */private long SDBMHash (String str) {long hash = 0; int I = 0; for (I = 0; I <str. length (); I ++) {hash = (str. charAt (I) + (hash <6) + (hash <16)-hash;} hash = Math. abs (hash); return hash;}/*** DJB character hash algorithm ** @ param str * @ return */private long DJBHash (String str) {long hash = 5381; int I = 0; for (I = 0; I <str. length (); I ++) {hash = (hash <5) + hash) + (str. charAt (I);} hash = Math. abs (hash); return hash ;}}
            
           
          
         
        
       
      
     
    
   
  
 
Scenario test Client. java:

Package BloomFilter; import java. text. messageFormat; import java. util. arrayList; import java. util. map;/*** BloomFileter bloom filter test class ** @ author lyq **/public class Client {public static void main (String [] args) {String filePath = "C: \ Users \ lyq \ Desktop \ icon \ input.txt "; String testFilePath =" C: \ Users \ lyq \ Desktop \ icon \ testInput.txt "; // The total number of query words int totalCount; // the correct number of results int rightCount; long startTime = 0; long endTime = 0; // the query result Map of the bloom Filter
 
  
BfMap; // Map of the query result of a common filter
  
   
NfMap; // query the total data ArrayList
   
    
QueryDatas; BloomFilterTool tool = new BloomFilterTool (filePath, testFilePath); // use the bloom filter to query the words startTime = System. currentTimeMillis (); bfMap = tool. queryDatasByBF (); endTime = System. currentTimeMillis (); System. out. println ("BloomFilter algorithm time consumption" + (endTime-startTime) + "ms"); // use a common filter to query the word startTime = System. currentTimeMillis (); nfMap = tool. queryDatasByNF (); endTime = System. currentTimeMillis (); System. out. println ("Time consumed by normal traversal query operations" + (endTime-startTime) + "ms"); boolean isExist; boolean isExist2; rightCount = 0; queryDatas = tool. getQueryDatas (); totalCount = queryDatas. size (); for (String qWord: queryDatas) {// use the result of the traversal query as the standard result isExist = nfMap. get (qWord); isExist2 = bfMap. get (qWord); if (isExist = isExist2) {rightCount ++;} else {System. out. println ("incorrect prediction words:" + qWord) ;}} System. out. println (MessageFormat. format ("the correct number of Bloom Filters is {0}, the total number of queries is {1}, and the correct rate is {2}", rightCount, totalCount, 1.0 * rightCount/totalCount ));}}
   
  
 
In the algorithm test class, I compared the time performance of the Bloom Filter and the common traversal search methods. When the data volume is small, there is actually no difference, it is even possible that the bloom filter may take a longer time, such as a test result below:

BloomFilter algorithm took 2 ms ordinary traversal query operation time 0 msBloom Filter correct number is 11, the total number of queries is 11, [from the Internet (http://www.68idc.cn)] accuracy rate 1
However, when I use real test data for testing, I cached the raw data in a standard document and doubled the number of query results, then execute the same program and the result changes to the following:

The BloomFilter algorithm takes 16 ms for normal traversal query operations. 47. The correct number of msBloom filters is 2,743, and the total number of queries is 2,743. The correct rate is 1.
In fact, this is not enough to simulate the scenario of massive data. It is not difficult to understand this result. The normal brute force search is related to the total amount of raw data, and the time complexity is O (n, the Bloom Filter is a constant level. It is okay to do a hash ing. time complexity O (l ),

Algorithm Summary

The algorithm encountered some minor problems during implementation. First, when using the hash function, because I randomly selected a three-character hash function, I later found that the algorithm would always cross the border, an out-of-boundary value will become negative and an error will be reported through BitSet. Originally, the unsigned int can be used in the C language. This concept is not used in java, so the absolute value of hash is taken directly. One feature of the Bloom Filter algorithm is that data may be incorrectly judged, but it will never be missed. A false positive is to identify elements that do not exist in a set, on the grounds that hash conflicts may cause this result, the missing judgment refers to the fact that an existing element is determined to be a non-existent set. This is absolutely impossible, because if you exist, the position you represent will be mapped to by hash, once mapped, it will not be missed when you search again. The application scope of the algorithm is actually quite large, such as filtering junk email addresses.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.