Hash of a massive data processing tool: Online mail address filtering

Source: Internet
Author: User
Tags hash mail

The title uses massive data (massive datasets) rather than large data. Feel the big data or a little bit empty, to some practical.

I. Demand

Now we need to design a solution to filter the spam address online, we already have 1 billion legal email addresses (called legal address set S) in our database, and when a new message comes in, check that the email address is in our database, and if so, we receive the email, if not, We filter it out as spam mail.

Two, the method that the intuition thinks

Once you get the question, I thought of using log (n) binary lookup, first 1 billion mail address sorting, when received an email address, I use binary lookup, see whether the e-mail address in S, log (1,000,000,000) = 29.89 is equal to 30, for each mail Address I only need to look up 30 times, it is very fast, should be able to meet the requirements. Think about it, binary find must be put into memory, 10 mail address is similar, 1 billion mail address we calculate how big, the average length of mail address by 20 characters, one character occupies 1Byte, an email accounted for 20b,1,000,000,000x20b = 20GB, the memory can be binary, of course, multiple segments to be used, when the segmentation requires multiple I/O operations, the time required is not online filtering can withstand.

Third, the use of hash processing problems

When the volume of data is very large, some of our fast methods have not been given more time to solve, is not a solution, binary lookup method is not feasible in the acceptable range. Let's introduce a magical method that uses hash and bitmap to implement constant time to determine whether the mailing address is in S.

1. Preliminary Design of filter

We apply for a 1GB of memory (although a bit large, but now the PC is up to the top), 1B total 8, 1GB a total of 8 billion (the actual should be 8x2^30=8,589,934,592 bit, but in order to facilitate the description, we use 8 billion-bit) this bitmap with B, b[i] Table Shows the first bit of the bitmap, designing a hash function that maps the mail address to 18 billion of the integer space. The first 8 billion digits are set to 0, and then for each of the mail address in S to get an integer k, the k position is 1, that is, b[k]=1, if the hash function is well designed, hash after the s, 8 billion of the bitmap should have 1 billion (actual value is smaller than 1 billion, Later will be detailed analysis of the value of 1, when the message is received, the email address to the hash, the result of the hash is P, if b[p]=0, the mail address must no longer s, that is, as spam filtering; if b[p]=1, the message is filtered to receive mail. Can be combined with the following diagram to understand:

Note that when b[p]=1, we are not saying that the new email address must be in s, but that it is probably in S. B[p]=1 can only show that s must have a mail address URL that makes the hash (URL) =p, which does not guarantee that the hash value of the other spam address is not equal to P. B [p]=0 Description s does not exist in the mail address URL, so that the hash (URL) =p, so that the new e-mail address must not be in S. Therefore, messages that are filtered as spam must not contain legitimate addresses. And through the filtered address is still likely to be spam address, about 1/8 (1/8=10 billion/8 billion, but the actual value is smaller than 1/8, the following will be detailed analysis) of spam through filtration, 1/8 also known as pseudo-positive rate.

There is a problem with such a scheme being directly online, because the rate of pseudo positive is high, so let's calculate the rate of spam in the mail we receive according to this scheme. According to the news report 80% of the world's mail is spam, so p (Received spam/received mail) = (1/8*80%)/(20%+1/8*80%) =1/3 (33.33%), which means that an average of three emails are spam, which is intolerable to users, The scheme must be optimized, but we should see that we have filtered 7/8 of spam messages without omitting any legitimate emails, and have already shown the sharpness of the hash weapon.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.