Hash for massive data processing-online mail address filtering

Source: Internet
Author: User

The title uses massive data instead of big data ). It seems that big data is still a little virtual.

I. Requirements

Now we need to design a solution for filtering spam addresses online. Our database already has 1 billion valid email addresses (called valid address set S ), when a new email is sent, check whether the email address is in our database. If it is in, we receive the email. If it is not, we can filter it out as spam.

2. Intuitive Methods

Once I got this question, I thought of using log (n)'s half lookup. I first sorted 1 billion mail addresses. When I received an email address, I used half lookup, check whether the email address is in S. Log (1,000,000,000) = 29.89 is about 30. For each email address, I only need to search for 30 times at most. It feels fast and should meet the requirements. Think about it, the half-fold search must be placed in the memory. The 10 email addresses are about the same. We can calculate the size of the 1 billion email addresses. The average length of the email addresses is 20 characters, one character occupies 1 byte, and one email occupies 20 B, 1,000,000,000 x20b = 20 GB. Is the memory too strong? Of course, you can fold it in multiple sections, when multiple I/O operations are required for a segment, the time required is no longer sufficient for online filtering.

Iii. Use hash to handle problems

When the data volume is large, some of our quick methods can be solved without giving more time, but they cannot be solved at all. The half-lookup method is unacceptable. We will introduce a magic method, using hash and bitmap to achieve constant time to determine whether the email address is in S.

1. Preliminary Design of the filter

We apply for a 1 GB memory (although it is a little big, but now the PC is too strong), a total of 8 bits in 1b, 1 GB has a total of 8 billion bits (actually it should be 8 X2 ^ 30 = 8,589,934,592 bits, but we use 8 billion bits here for convenience.) This bitmap is represented by B, and B [I] represents the I-bit of the bitmap. A hash function is designed to map the email address to an integer ranging from 1. First, set all the 8 billion bits to 0, and then hash each email address in S. After hash, an integer k is obtained, and the K bit is set to 1, that is, B [k] = 1. If the hash function is well designed, after the hash function is complete, 8 billion bitmaps should be 1 billion (the actual value is smaller than 1 billion, which will be analyzed in detail later) the value of BITs is 1. When an email is received, hash the email address and note that the hash result is P. If B [p] = 0, the email address must no longer be in s, that is, it is used as a junk mail filter; if B [p] = 1, the mail is filtered to receive the mail. It can be understood in combination:

Note that when B [p] = 1, we do not mean that the new email address must be in S, but probably in S. B [p] = 1 can only indicate that there must be a Mail Address URL in S, so that hash (URL) = P, which cannot ensure that the hash value of other spam addresses is not equal to P. B [p] = 0 indicates that there is no Email Address URL in S, so that hash (URL) = P, so that the new email address must not be in S. Therefore, emails filtered as spam do not contain valid addresses, and the filtered addresses may still be Spam addresses, which are about 1/8 (1/8 = 1 billion/8 billion, however, the actual value is smaller than 1/8, which will be analyzed in detail later.) spam is filtered, and 1/8 is also known as the false positive rate.

There is still a problem with the direct launch of such a solution. Because the pseudo-positive rate is relatively high, let's calculate the ratio of spam in the emails we receive according to this solution. According to news reports, 80% of emails worldwide are spam, so P (received spam/received emails) = (1/8 * 80%)/(20% + 1/8 * 80%) = 1/3 (33.33%), that is to say, one of the three emails is spam on average, which is intolerable for users. The solution must be optimized, but we should see that, we have filtered out 7/8 of spam without missing any valid email, and we have begun to see a powerful hash tool.

2. Pseudo-positive rate analysis

We mentioned earlier that the pseudo-positive rate is not 1/8, which is slightly smaller than 1/8. Now we can calculate the real value of the pseudo-positive rate from the probability perspective.

Let's take a look at another simple example. Suppose we have m darts, N targets, and a shooting master (the so-called master is a real man who won't miss the target no matter how he shoots it) take the M darts one by one to shoot the N targets. Assume that the probability of a Dart Hitting each Harrow is equal, what is the probability that a dart does not have on a target? This computation is not difficult. For any raking W, the probability P (the raking W is hit by a dart) = 1/N, then, the probability P (W won't be hit by a dart) = 1-1/n. Shooting M darts can be seen as M independent repeat events, so P (raking W is not hit by any darts) = (1-1/n) ^ m, which is the probability that a dart on the raking has no chance. Now let's ask, is there at least one probability of a dart? With the analysis in the previous step, we can know that P (there is at least one dart on the Harrow W) = 1-P (the Harrow W is not hit by any dart) = 1-(1-1/n) ^ m.

Now back to our previous problem, our 8 billion-bit equivalent to N Harrow in the previous example, 1 billion valid email addresses are equivalent to M darts, and the hash function is equivalent to a shooting expert. After the hash function of the collection s, the Probability p = 1-(1-1/8, 000,000,000) ^ 1,000,000,000 when a bit is 1 (that is, at least once hit ), it is unrealistic to directly use a calculator to calculate this value, because dividing 1 by 8 billion has already exceeded 0, and then performing another 1 billion multiplication is meaningless. The calculation of this value requires the ultimate knowledge. The great mathematician has helped us calculate it. We only need to set the formula. Do you still remember the formula below:

We can see that 0.1175 is not much different from our preliminary estimate of 1/8 = 0.125, so we used 1/8 for analysis.

P (a bit value is 1), that is, the pseudo-positive rate, that is, the probability of spam receiving. Here we will explain why P (a bit value is 1) it is the probability that spam is received: the probability that we receive a new mail, hash the address to P, and B [p] to 1 is the probability that we receive the mail, obviously P (B [p] = 1) = P (a bit value is 1 ).

3. Filter Optimization to reduce false positive rate

As we have previously calculated, according to the above scheme, one of every three emails received by the user is spam. We improved the filter and set K hash functions H1, H2,..., HK. The ing space of each hash function is an integer set of 1 to 80 million. Calculate each email address in s using K hash functions. The hash result is P1, P2 ,..., PK, the bitwise is set to 1, that is, B [Pi] = 1 (I = 1, 2 ,.., k) when a new mail is sent, we calculate the new mail address on K hash functions. Likewise, the hash result is P1, P2 ,..., PK. If all bits in the hash result are 1, the email will be accepted. Otherwise, the new address will no longer be in S and will be filtered as spam. When K is 2:

Like a hash function, setting K hash functions does not miss any valid email, but there is still spam in the received email, now let's calculate the pseudo-positive rate at this time. Based on the previous analysis, for a single hash function, the probability of a single digit being 1 is P = 1-E ^ (-M/N ).

Now we have k hash functions, which is equivalent to k x m darts. Therefore, the probability of a certain bit being 1 is P = 1-E ^ (-K * M/N ), however, at this time, the pseudo-positive rate is no longer equal to P (a bit value is 1). We receive emails only when the bits of K hash results are 1, P (the bits of K hash results are 1) = (1-e ^ (-K * M/N) ^ K, therefore, the pseudo-positive rate is (1-e ^ (-K * M/N) ^ K. When k = 2, the pseudo-positive rate is P = (1-e ^ (-1/4 )) ^ 2 = 0.048929094. The value is much smaller than the value of 0.1175 in a hash function. Is the pseudo-positive rate lower? Let's look at the graph:

It is found that the pseudo-positive rate decreases first and then increases with K, and finally tends to 1. The optimal K value is between 5 and 6, but K is an integer. Therefore, when K is 6, the pseudo-positive rate is the minimum, set six hash functions to obtain the minimum pseudo-positive rate. The pseudo-positive rate is 0.0216. In general, the optimal value K = ln (2) * N/m. Remember this answer. If you are interested, refer to the following calculation process:

Now let's calculate the ratio of the received spam to the received email (received spam/received emails) in the optimal case k = 6) = 80% * 0.0216/(20% + 80% * 0.0216) = 0.077490775 = 1/13, that is, an average of 13 emails received are spam. I think this is similar to my mailbox, it reaches the user's tolerable range.

If we apply for 2 GB memory, we have 16 billion bits, K = ln (2) * N/m = ln (2) * 16 = 11.09, that is, K = 10, false positive rate P = 0.0004587, P (received spam/received emails) = 80% * 0.0004587/(20% + 80% * 0.0004587) = 0.00183144 = 1/546, that is to say, an average of 546 emails are received before receiving a spam email, which is completely in line with the business requirements. To know that this solution is very fast, you only need to calculate several hash functions and then check the bitmap, can be computed online (of course, hash ing S is required in advance ).

The concern of multiple hash functions is the bloom filter ).

When a new valid email is added, add the new mail address to S, perform K hash, and set the corresponding location to 1, in this way, new valid emails will not be filtered. When a new combination of emails is added to a certain number, you need to re-calculate K and re-Hash S.

In fact, more junk mail filtering is based on the mail content. After our filter, we can analyze and filter the content processed by natural language, our filters and filters out the vast majority of spam.

We have witnessed the power of hash to process massive data. This is just an example. I believe you can use it elsewhere.

Finally, I want to say that I finally found that the previous mathematical analysis, advanced algebra, and probability theory were useful, didn't they?

References:

[1]. Anand rajaraman, Jeffrey davaid uiiman. Mining of massive datasets. Cambridge University Presss 2012.

[2]. Wikipedia.E(Mathematical constant ).Https://zh.wikipedia.org/wiki/E_ (% E6 % 95% B0 % E5 % ad % a6 % E5 % B8 % B8 % E6 % 95% B0). 2013.5.27

[3]. Jure lew.ec. Stanford cs246 mining massive data sets. http://www.stanford.edu/class/cs246/.2013 (A good course, recommended)

[4]. Wikipedia.Email spam.Https://en.wikipedia.org/wiki/Email_spam.2013.6.21

PS: drawing tool: word2010

Data Processing: excel2010

Thank you for your guidance on mathematical proof!

Thanks for your attention. Welcome to the comments.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.