Atitti text classification and spam judgment principle and application solution of Bayesian algorithm
1.1. What is a Bayesian filter? 1
1.2. Viii. Establishment of a historical database 2
1.3.10. calculation of joint probabilities 3
1.4. 11, Final calculation formula 3
1.5. At this point we also need a threshold for comparison. Paul Graham 's threshold is 0.9, with a probability greater than 0.9,4
1.1.
Vii. What is Bayesian filter?
Junk e-mail is a headache that plagues all internet users.
the technique of correctly identifying spam is very difficult. The traditional method of spam filtering mainly includes "keyword method" and "Check Code method". The former filter is based on specific words, while the latter is the verification code of the text of the message, and then compared with the known spam. Their recognition effect is not ideal, and it is easy to avoid.
2002, Paul Graham the use of "Bayesian inference" to filter spam messages is proposed . The effect, he said, was fantastic. 1000 spam messages can filter out 995, without a miscarriage of judgment.
In addition, this filter also has self-learning function, will be based on the newly received mail, constantly adjusted. The more spam you receive, the higher its accuracy rate.
1.2.
Viii. Establishment of a historical database
Bayesian filter is a statistical filter, based on the existing statistical results. So, we have to provide two sets of identified mail in advance, one group is normal mail, the other is junk e-mail.
We use these two sets of mails to "train" the filter. The larger the size of these two groups of messages, the better the training effect. Paul Graham used a message size of 4000 messages for normal mail and junk mail.
The "training" process is simple. First, parse all messages and extract each word. Then, calculate how often each word appears in normal messages and spam messages. For example, we assume that the word "sex", which contains 200 of the 4000 spam messages, is 5%, whereas in 4000 normal messages only 2 contain the word, then the frequency is 0.05%. ("note" If a word appears only in spam, Paul Graham assumes that it appears in normal messages at a frequency of 1% and vice versa.) This is done to avoid a probability of 0. As the number of messages increases, the calculation results are automatically adjusted. )
With this preliminary statistical result, the filter can be put into use.
1.3.
10. Calculation of joint probability
To finish the above step, can we conclude that this new email is spam?
the answer is no. Because an email contains many words, some words (like sex) say it's spam, others say it's not. How do you know which Word will prevail?
Paul Graham's approach was to elect this letter in P (s| W) The highest of 15 words, calculating their joint probabilities
The so-called joint probability refers to the probability of another event occurring in the case of multiple events. For example, known W1 and W2 are two different words, they all appear in an e-mail message, then this email is the probability of spam, is the joint probability.
1.4.
11, the final calculation formula
By extending the above equation to 15 words, the final probability formula is obtained:
P =1-(1-P1) * (1-P2) * (1-P3);
If an e-mail is spam, use this formula to calculate
1.5.
. At this point we also need a threshold for comparison. Paul Graham 's threshold value is 0.9, the probability is greater than 0.9,
indicates 15 words joint cognizance, this email has more than 90% may belong to the spam, the probability is less than 0.9, indicates is the normal mail.
with this formula, a normal letter will not be considered spam even if the word sex is present .
Resources
Bayesian inference and its Internet Application (ii): Filtering spam -Nanyi blog. html
the principle of atitit Bayesian algorithm and the principle of spam classification
author:: Nickname :Old Wow's claws( Full Name::AttilaxAkbar Al Rapanui Attilaksachanui)
Kanji Name: Etila ( Ayron) , email:[email protected]
reprint Please indicate source: http://www.cnblogs.com/attilax/
Atiend
Atitti text classification and spam judgment principle and application solution of Bayesian algorithm