1. Basic Steps of Bayesian filtering algorithms
1) collect a large number of spam and non-spam mails, and create a spam set and a non-spam set.
2) extract the independent strings in the subject and body of the email, such as abc32 and ¥234, as the token string, and count the number of times the extracted token string appears, that is, the word frequency. The preceding methods are used to process the spam set and all the emails in the non-spam set.
3) Each mail set corresponds to a hash table, and hashtable_good corresponds to a non-spam mail set while hashtable_bad corresponds to a spam mail set. The table stores the ing between the token string and the word frequency.
4) calculate the probability that a token string appears in each hash table P = (Word Frequency of a token string)/(corresponding to the length of the hash table)
5) Considering Both hashtable_good and hashtable_bad, It is inferred that a token string appears in the new mail, and the new mail is likely to be spam. The mathematical expression is:
A event-the email is spam;
T1, T2 ....... TN indicates the token string
P (A | Ti) indicates the probability that the email will be spam when the token string Ti appears in the email.
Set
P1 (Ti) = (the value of Ti in hashtable_good)
P2 (Ti) = (the value of Ti in hashtable _ bad)
P (A | Ti) = p1 (Ti)/[(p1 (Ti) + p2 (Ti)];
6) create a new hash table hashtable_probability to store the Ti ing between the token string Ti and P (A | TI ).
7) So far, the learning process of spam and non-spam is over. Based on the created hash table hashtable_probability, You can estimate the possibility of a new mail being spam.
When a new email is sent, follow Step 2 to generate a token string. Query hashtable_probability to obtain the key value of the token string.
Assume that N token strings, T1, T2… are obtained from the email ....... In TN and hashtable_probability, the corresponding values are P1, P2 ,...... Pn,
P (A | T1, T2, T3 ...... Tn) indicates that multiple token strings T1, T2... appear simultaneously in the email ....... TN indicates the probability that the email is spam.
Available from the compound probability formula
P (A | T1, T2, T3 ...... Tn) = (P1 * P2 *.... PN)/[P1 * P2 *..... Pn + (1-p1) * (1-p2 )*... (1-Pn)]
When P (A | T1, T2, T3 ...... Tn) when the threshold is exceeded, you can determine that the email is spam.
Ii. Bayesian filtering algorithm example
For example, a spam email containing the words "method wheel"
And a non-spam email containing the words "legal" B
Generate hashtable _ bad according to mail a. The record in the hash table is
Method: 1 time
Round: 1 time
Merit: 1 time
Calculated in this table:
The probability of occurrence is 0. 3
The probability of a wheel appears is 0. 3
The probability of Power occurrence is 0. 3
Generate hashtable_good according to mail B. The record in this hash table is:
Method: 1
Law: 1
Calculated in this table:
The probability of occurrence is 0. 5
The probability of a law is 0. 5
Considering two hash tables, there are a total of four token strings: Law Wheel
When "method" appears in an email, the probability of the email being spam is:
P = 0. 3/(0. 3 + 0. 5) = 0. 375
When "Wheel" appears:
P = 0. 3/(0. 3 + 0) = 1
When "power" is displayed:
P = 0. 3/(0. 3 + 0) = 1
When "law" appears
P = 0/(0 + 0. 5) = 0;
The third hash table: hashtable_probability. Its data is:
Method: 0. 375
Round: 1
Merit: 1
Law: 0
When a new email contains a "Power Law", we can get two token strings, the power law.
Query the hash table hashtable_probability.
P (SPAM | merit) = 1
P (SPAM | Law) = 0
At this time, the mail is likely to be spam:
P = (0*1)/[0*1 + (1-0) * (1-1)] = 0
This mail is not spam.