Basic steps of Bayesian spam filtering algorithm

Source: Internet
Author: User
1. Basic Steps of Bayesian filtering algorithms

1) collect a large number of spam and non-spam mails, and create a spam set and a non-spam set.
2) extract the independent strings in the subject and body of the email, such as abc32 and ¥234, as the token string, and count the number of times the extracted token string appears, that is, the word frequency. The preceding methods are used to process the spam set and all the emails in the non-spam set.
3) Each mail set corresponds to a hash table, and hashtable_good corresponds to a non-spam mail set while hashtable_bad corresponds to a spam mail set. The table stores the ing between the token string and the word frequency.
4) calculate the probability that a token string appears in each hash table P = (Word Frequency of a token string)/(corresponding to the length of the hash table)
5) Considering Both hashtable_good and hashtable_bad, It is inferred that a token string appears in the new mail, and the new mail is likely to be spam. The mathematical expression is:
A event-the email is spam;
T1, T2 ....... TN indicates the token string
P (A | Ti) indicates the probability that the email will be spam when the token string Ti appears in the email.
Set
P1 (Ti) = (the value of Ti in hashtable_good)
P2 (Ti) = (the value of Ti in hashtable _ bad)
P (A | Ti) = p1 (Ti)/[(p1 (Ti) + p2 (Ti)];
6) create a new hash table hashtable_probability to store the Ti ing between the token string Ti and P (A | TI ).
7) So far, the learning process of spam and non-spam is over. Based on the created hash table hashtable_probability, You can estimate the possibility of a new mail being spam.
When a new email is sent, follow Step 2 to generate a token string. Query hashtable_probability to obtain the key value of the token string.
Assume that N token strings, T1, T2… are obtained from the email ....... In TN and hashtable_probability, the corresponding values are P1, P2 ,...... Pn,
P (A | T1, T2, T3 ...... Tn) indicates that multiple token strings T1, T2... appear simultaneously in the email ....... TN indicates the probability that the email is spam.
Available from the compound probability formula
P (A | T1, T2, T3 ...... Tn) = (P1 * P2 *.... PN)/[P1 * P2 *..... Pn + (1-p1) * (1-p2 )*... (1-Pn)]
When P (A | T1, T2, T3 ...... Tn) when the threshold is exceeded, you can determine that the email is spam.

Ii. Bayesian filtering algorithm example

For example, a spam email containing the words "method wheel"
And a non-spam email containing the words "legal" B
Generate hashtable _ bad according to mail a. The record in the hash table is
Method: 1 time
Round: 1 time
Merit: 1 time
Calculated in this table:
The probability of occurrence is 0. 3
The probability of a wheel appears is 0. 3
The probability of Power occurrence is 0. 3
Generate hashtable_good according to mail B. The record in this hash table is:
Method: 1
Law: 1
Calculated in this table:
The probability of occurrence is 0. 5
The probability of a law is 0. 5
Considering two hash tables, there are a total of four token strings: Law Wheel
When "method" appears in an email, the probability of the email being spam is:
P = 0. 3/(0. 3 + 0. 5) = 0. 375
When "Wheel" appears:
P = 0. 3/(0. 3 + 0) = 1
When "power" is displayed:
P = 0. 3/(0. 3 + 0) = 1
When "law" appears
P = 0/(0 + 0. 5) = 0;
The third hash table: hashtable_probability. Its data is:
Method: 0. 375
Round: 1
Merit: 1
Law: 0

When a new email contains a "Power Law", we can get two token strings, the power law.
Query the hash table hashtable_probability.
P (SPAM | merit) = 1
P (SPAM | Law) = 0
At this time, the mail is likely to be spam:
P = (0*1)/[0*1 + (1-0) * (1-1)] = 0
This mail is not spam.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.