Bloomfilter Introduction and application in the Hadoop reduce side join __hadoop

Source: Internet
Author: User
Introduction to Bloomfilter and its application in the Hadoop reduce side join1, Bloomfilter can solve what problem? A small amount of memory space to determine whether an element belongs to this set, at the cost of a certain error rate

2. Working principle
1. Initialize an array with all the bits labeled 0, a={x1, x2, x3,..., XM} (x1, x2, x3,..., XM initially 0)
2. Each of the arrays in the known set S is mapped to a in the following manner
2.0 Select n Mutually independent hash function H1, H2, ... hk
2.1 A set of indexed values H1 (xi), H2 (xi),..., HK (xi) by using the above hash function
2.2 Mark the above index value in set A to 1 (Repeat for 1 if the different elements are duplicated, which is a search operation)
3. For an element x, hash it according to the hash function selected in 2.0, get a set of index values H1 (x), H2 (x), ..., HK (x)
If the value on these index positions in collection A is 1, it means that the element belongs to the set S, otherwise it does not belong to s

An example is provided:

Establishing a bit array structure of 5 million (the size of the bit array and the number of keyword determines the probability of miscalculation), each of the keyword in the set computes 32 digits by 32 hash functions, and then the 32 digits are modulo 5 million respectively, The corresponding position in the bit array is then 1, which we call the eigenvalues. Simply put each keyword to 32 locations in the bit array, as shown in the following figure:

When you need to quickly find a keyword, simply pass it through the same 32 hash function and map to the corresponding bit in the bit array if the corresponding bit in the bit array is 1, then the keyword match succeeds (there is a chance of miscalculation).


3. Several prerequisites
1. The calculation of the hash function can not be too poor performance, otherwise outweigh the gains
2. Any two hash functions must be independent of each other.
That is, any two hash functions do not have a single dependency, or the elements of the hash to one of the indexes must also be hashed onto another related index, so that multiple hashes have no meaning


4. Error rate
Working principle of the 3rd step, out of the conclusion, one is absolutely reliable, one is not 100% reliable. When deciding whether an element belongs to a set, it is possible to mistake an element that is not part of the set to the set (false positive). Therefore, Bloom filter is not suitable for those "0 error" applications. In an application where the low error rate is tolerated, Bloom filter saves a great deal of storage space with very few errors. With regard to the specific error rate, this is related to the optimal number of hash functions and the size of the bit array, which can be estimated to obtain an optimal solution:
There is a correlation between the number of hash functions K, the bit array size m, and the number of strings N.  It is proved that the probability of error for a given m, n, k = ln (2) * m/n is minimal. Please see the specific: http://blog.csdn.net/jiaomeng/article/details/1495500


5. Basic Features
From the above analysis of basic principles and mathematical basis, we can get the following basic characteristics of bloom filter, which is used to guide practical application.
(1) There is a certain error rate, which occurs on the positive judgment (existence), and the reverse judgment does not occur (no existence);
(2) Error rate is controllable, by changing the size of the bit array, hash function number or lower collision rate of the hash function to adjust;
(3) Maintain a low error rate, the bit array vacancy remains at least more than half;
(4) given M and N, the optimal hash number can be determined, that is, k = LN2 * (m/n), at which time the error rate is minimal;
(5) Given the allowable error rate E, you can determine the appropriate bit array size, that is, M >= log2 (E) * (n * log2 (1/e)), and then determine the number of hash functions K;
(6) The positive error rate can not be completely eliminated, even if not the size of the bit array and the number of hash function limit, that is, the zero error rate can not be achieved;
(7) Space efficiency is high, save only "existence state", but cannot store complete information, need other data structure auxiliary storage;
(8) Element deletion is not supported because the security of the deletion is not guaranteed.


6. Examples of application scenarios:
(1) Spell check, database system, file system
(2) Suppose you want to write a network spider (web crawler). Because of the complexity of the links between networks, spiders are likely to form a "ring" crawling between networks. To avoid the formation of a "ring," you need to know that spiders have visited those URLs. To a URL, how to know whether the spider has been visited.
(3) Network application
Peer-to-peer Network to find resource operations, you can save Bloom Filter for each network path, when hit, then select the Access path.
When broadcasting messages, you can detect whether an IP has been awarded.
Detects the loop of the broadcast message packet, saves the Bloom filter in the package, and each node adds itself to bloom filter.
Information queue management, using counter Bloom filter to manage information flow.
(4) spam email address filtering
Public e-mail providers such as NetEase and QQ always need to filter spam from people who send spam (Spamer). One way to do that is to keep a record of e-mail addresses that send spam. As those senders are constantly registering new addresses, the world says there are billions of spam addresses, and it requires a large number of Web servers to save them. If you use a hash table, each store 100 million email addresses, you will need 1.6GB of memory (the specific way to implement the hash tables is to each email address into a eight-byte fingerprint of information, and then put the information fingerprint into a hash table, because the hash table storage efficiency is generally only 50%, So an email address needs to occupy 16 bytes. 100 million addresses are approximately 1.6GB, or 1.6 billion bytes of memory. So storing billions of mail addresses may require hundreds of gigabytes of RAM. Bloom filter only needs the hash table 1/8 to 1/4 size to solve the same problem. Bloom filter never misses any suspicious address in the blacklist. The common remedy for miscalculation is to create a small whitelist that stores e-mail addresses that may not be misjudged.
(5) The role of Bloomfilter in HBase
HBase uses Bloomfilter to improve the performance of random read (get), and for sequential reads (Scan), Setting Bloomfilter is not useful (0.92 after, if set bloomfilter for Rowcol, for the designated qualifier scan have some optimization, but not the kind of direct filtering files, excluded in the form of lookup range)
The cost of Bloomfilter in HBase.
Bloomfilter is a column family (CF) Level configuration property, if you set the Bloomfilter in the table, then HBase will include a bloomfilter structure of data when generating storefile, called Metablock Metablock and DataBlock (real keyvalue data) are maintained by Lrublockcache. Therefore, open bloomfilter will have a certain amount of storage and memory cache overhead.
Bloomfilter How to improve the performance of random read (get).
For a random read of a region, HBase traverses the read Memstore and StoreFile (in a certain order) and returns the results to the client. If you set the Bloomfilter, you can use Bloomfilter to ignore certain storefile when traversing the storefile.
Note: HBase Bloom filter is inert to load, in the case of high write pressure, there will be a continuous compact and produce storefile, then the new storefile will not immediately load the bloom filter into memory, Wait until the read request comes to load.
This is the problem, first, if the storefile set larger, max size 2G, which will lead to bloom filter is also relatively large, and second, the system read and write pressure is relatively large. This may often cause a single GET request to take 3-5 seconds to timeout.

7, reduce side join + bloomfilter application in Hadoop example:
In some cases, the key collection of semijoin extracted small tables is still not stored in memory, and you can use Bloomfiler to save space. Save keys in a small table to Bloomfilter, filter large tables in the map phase, there may be some records that are not in the small table are not filtered out (but the records in the small table will not be filtered), it does not matter, but added a small amount of network IO. Finally, you can do a table join in the reduce phase.
In fact, the process needs to do bloomfilter training on the data of the small table, construct a Bloomfilter sample file (binary), put it into the distributed cache, and then read it in the map phase to filter the large table. And Hadoop has already supported bloomfilter, we just need to tune the appropriate API, OK, the following code.

01 Import Java.io.BufferedReader;
02 Import java.io.IOException;
03 Import Java.io.InputStreamReader;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.