Bayesian inference and its Internet application (ii) filtering spam

Source: Internet
Author: User
Tags filter mail

Last time, I introduced the principle of Bayesian inference, and today we'll talk about how to use it for spam filtering.

========================================

Bayesian inference and its application in Internet

Author: Ruan Yi Feng

(up to the above)

Seven, what is the Bayesian filter?

Spam is a vexing ailment that plagues all internet users.

The technique of correctly identifying spam is very difficult. The traditional method of spam filtering mainly includes "keyword method" and "Check Code method". The former is based on a specific word, while the latter is a check code for the text of the message, which is compared with a known spam message. Their recognition is not satisfactory, and it is easy to avoid.

In 2002, Paul Graham proposed using "Bayesian inference" to filter junk mail. The effect, he said, was incredibly good. 1000 spam messages can filter out 995, without a single miscalculation.

In addition, this kind of filter also has the Self-learning function, will adjust according to the newly received mail, unceasingly. The more spam you receive, the higher the rate of accuracy.

Viii. establishment of a historical database

The Bayesian filter is a statistical filter based on existing statistical results. So we have to provide two groups of messages that have been identified in advance, one for normal and the other for spam.

We use these two groups of emails to "train" the filters. The larger the size of the two groups of messages, the better the training effect. The size of the message Paul Graham uses is 4000 of the normal mail and spam messages.

The "training" process is simple. First, parse all the emails and extract every word. Then, calculate how often each word appears in a normal message and in a spam message. For example, we assume that the word "sex", in 4000 spam messages, contains 200 of the word, and that it occurs at 5%, whereas in 4000 normal messages only 2 contain the word, the frequency is 0.05%. (Note If a word appears only in spam, Paul Graham assumes that it appears at 1% frequency in normal mail and vice versa.) This is done to avoid a probability of 0. As the number of messages increases, the results are automatically adjusted. )

With this preliminary statistical result, the filter can be put into use.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.