Massive data processing methods and analysis (1/3)

Source: Internet
Author: User
Tags hash memory usage

1. bloom filter

Applicability: it can be used to implement a data dictionary, determine the data weight, or calculate the intersection of data sets.

Basic principles and key points:
The principle is very simple, with a bit array + k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. If we find that all the corresponding bits of the hash function exist as 1, it is obvious that this process does not guarantee that the search result is 100% correct. At the same time, a inserted keyword cannot be deleted, because the bit corresponding to this keyword affects other keywords. Therefore, a simple improvement is the counting bloom filter, which can be deleted by replacing the bitwise array with a counter array.

Another important question is how to determine the size of the bit array m and the number of hash functions based on the number of input elements n. When the number of hash functions is k = (ln2) * (m/n), the error rate is the minimum. If the error rate is not greater than e, m must at least be equal to n * lg (1/e) to represent a set of any n elements. But m should be larger, because at least half of the bit array should be 0, then m should be> = nlg (1/e) * lge is probably nlg (1/e) 1.44 times (lg represents the base 2 logarithm ).

For example, if the error rate is 0.01, then m should be 13 times that of n. In this case, k is about 8.

Note that the unit of m is different from that of n, m is bit, and n is the unit of the number of elements (accurately speaking, the number of different elements ). Generally, the length of a single element is many bits. Therefore, the use of bloom filter memory is usually saved.

Extension:
The bloom filter maps the elements in the set to an array. If k (k is the number of hash functions) ing bits are all 1, it indicates that the element is not in this set. Counting bloom filter (CBMs) extends each bit in the bit array to a counter, which supports the deletion of elements. Spectral bloom filter (sbf) associates it with the number of occurrences of the set element. Sbf uses the minimum value in counter to represent the occurrence frequency of elements.

Example of the problem: give you two files a and B, each containing 5 billion URLs, each occupying 64 bytes, the memory limit is 4 GB, let you find, the url of file B. What if there are three or even n files?

Based on this problem, we calculate the memory usage. 4g = 2 ^ 32 is about 4 billion * 8 is about 34 billion, n = 5 billion, if the error rate is 0.01, 65 billion bits are required. Currently, 34 billion is available, and there are not many differences. This may increase the error rate. In addition, if these URLs correspond one-to-one, you can convert them into ip addresses, which is much simpler.

2. hashing

Applicability: quick search and deletion of the basic data structure, which usually requires the total data volume to be stored in the memory.

Basic principles and key points:
Hash function selection, for strings, integers, arrangement, specific hash method.
For collision processing, one is open hashing, also known as the zipper method, and the other is closed hashing, also known as the open Address Method and opened addressing.

Extension:
D in d-left hashing refers to multiple meanings. Let's first simplify this problem and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of the same length, namely t1 and t2, and configuring a hash function, h1 and h2 for t1 and t2 respectively. When a new key is stored, two hash functions are used for calculation to obtain the addresses h1 [key] and h2 [key]. In this case, you need to check the h1 [key] location in t1 and the h2 [key] location in t2. Which location has been stored (with collision) and there are many keys, store the new key in a location with less load. If the two sides are as many as one, for example, if both locations are empty or both of them store a key, the new key is stored in the t1 subtable on the left, and 2-left is also stored. When searching for a key, you must perform two hashes and query both locations at the same time.

Homepage 1 2 3 Last page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.