Last time, I introduced the principle of Bayesian inference, and today we'll talk about how to use it for spam filtering.
========================================
Bayesian inference and its application in Internet
Author: Ruan Yi Feng
(up to the above)
Seven, what is the Bayesian filter?
Spam is a vexing ailment that plagues all internet users.
The technique of correctly identifying spam is very difficult. The traditional method of spam filtering mainly includes "keyword method" and "Check Code method". The former is based on a specific word, while the latter is a check code for the text of the message, which is compared with a known spam message. Their recognition is not satisfactory, and it is easy to avoid.
In 2002, Paul Graham proposed using "Bayesian inference" to filter junk mail. The effect, he said, was incredibly good. 1000 spam messages can filter out 995, without a single miscalculation.
In addition, this kind of filter also has the Self-learning function, will adjust according to the newly received mail, unceasingly. The more spam you receive, the higher the rate of accuracy.
Viii. establishment of a historical database
The Bayesian filter is a statistical filter based on existing statistical results. So we have to provide two groups of messages that have been identified in advance, one for normal and the other for spam.
We use these two groups of emails to "train" the filters. The larger the size of the two groups of messages, the better the training effect. The size of the message Paul Graham uses is 4000 of the normal mail and spam messages.
The "training" process is simple. First, parse all the emails and extract every word. Then, calculate how often each word appears in a normal message and in a spam message. For example, we assume that the word "sex", in 4000 spam messages, contains 200 of the word, and that it occurs at 5%, whereas in 4000 normal messages only 2 contain the word, the frequency is 0.05%. (Note If a word appears only in spam, Paul Graham assumes that it appears at 1% frequency in normal mail and vice versa.) This is done to avoid a probability of 0. As the number of messages increases, the results are automatically adjusted. )
With this preliminary statistical result, the filter can be put into use.