Introduction to bayesian Algorithms

Source: Internet
Author: User

I. Introduction to Bayes

Bayesian is a probability-based algorithm created by Thomas Bayes, a great mathematical master. It is widely used to filter spam. Bayesian filter is an intelligent technology based on "self-learning". It can adapt itself to new tricks of spammers and protect legitimate emails. In the smart mail filtering technology, Bayesian (Bayesian) filtering technology has achieved great success and is increasingly used in anti-spam products.

Ii. basic steps of Bayesian filtering algorithms


1. collect a large number of spam and non-spam mails, and create a spam set and a non-spam set.

2. Extract the independent strings in the subject and body of the email, such as ABC32 and ¥234, as the TOKEN string and count the number of times the extracted TOKEN string appears, that is, the word frequency. The preceding methods are used to process the spam set and all the emails in the non-spam set.

3. Each mail set corresponds to a hash table, and hashtable_good corresponds to a non-spam mail set while hashtable_bad corresponds to a spam mail set. The table stores the ing between the TOKEN string and the word frequency.

4. Calculate the probability that a TOKEN string appears in each hash table P = (Word Frequency of a TOKEN string)/(corresponding to the length of the hash table)

5. Considering Both hashtable_good and hashtable_bad, It is inferred that a TOKEN string appears in a new mail, and the new mail is likely to be spam. The mathematical expression is:

A event-the email is spam;

T1, t2 ....... Tn indicates the TOKEN string

P (A | ti) indicates the probability that the email will be spam when the TOKEN string ti appears in the email.

Set

P1 (ti) = (the value of ti in hashtable_good)

P2 (ti) = (the value of ti in hashtable _ bad)

P (A | ti) = P2 (ti)/[(P1 (ti) + P2 (ti)];

6. Create A new hash table hashtable_probability to store the ti ing between the TOKEN string ti and P (A | ti ).

7. At this point, the learning process of the spam and non-spam sets has ended. Based on the created hash table hashtable_probability, You can estimate the possibility of a new mail being spam.

When a new email is sent, follow Step 2 to generate a TOKEN string. Query hashtable_probability to obtain the key value of the TOKEN string.

Assume that n token strings, t1, t2… are obtained from the email ....... In tn, hashtable_probability, the corresponding values are P1, P2 ,...... PN, P (A | t1, t2, t3 ...... Tn) indicates that multiple TOKEN strings t1, t2… appear simultaneously in the email ...... Tn indicates the probability that the email is spam.

Available from the compound probability formula
P (A | t1, t2, t3 ...... Tn) = (P1 * P2 *...... PN)/[P1 * P2 *...... PN + (1-P1) * (1-P2 )*...... (1-PN)]

When P (A | t1, t2, t3 ...... Tn) when the threshold is exceeded, you can determine that the email is spam.

Iii. Bayesian filtering algorithm examples

For example, A spam email with the words "law" and A non-spam email with the words "law. Generate hashtable_bad according to mail A. The record in the hash table is

Method: 1 time

Round: 1 time

Merit: 1 time

Calculated in this table:

The probability of occurrence is 0.3.

The probability of a wheel appears is 0.3.

The probability of Power occurrence is 0.3.

Generate hashtable_good according to mail B. The record in this hash table is:

Method: 1 time

Law: 1 time

Calculated in this table:

The probability of occurrence is 0.5.

The probability of a law is 0.5.

Considering two hash tables, there are a total of four TOKEN strings: Law Wheel

When "method" appears in an email, the probability of the email being spam is:

P = 0.3/(0.3 + 0.5) = 0.375

When "Wheel" appears, the probability of the email being spam is:

P = 0.3/(0.3 + 0) = 1

The probability that the email is spam is displayed as follows:

P = 0.3/(0.3 + 0) = 1

When a "law" occurs, the probability of the email being spam is:

P = 0/(0 + 0.5) = 0

The third hash table hashtable_probability is obtained, and its data is:

Method: 0.375

Round: 1

Merit: 1

Law: 0

When a new email contains a "Power Law", we can get two TOKEN strings: Power Law.

To query the hash table hashtable_probability, you can get:

P (SPAM | merit) = 1

P (SPAM | Law) = 0

At this time, the mail is likely to be spam:

P = (0*1)/[0*1 + (1-0) * (1-1)] = 0

This mail is not spam.

Iv. Summary

Why does the yes' filter work so well? Because Bayesian filters work purely according to statistical rules, they are much simpler and more computable than filters that analyze the syntax or content of emails. More importantly, these tags can be created by the user based on the spam and non-spam messages they receive, so they can obtain a unique filter for the user. This means that spam senders cannot guess how your filters are configured to effectively block all types of spam.

However, although Bayesian filters are very effective, they still need to be optimized to be perfect. For example, it can reduce the false positive rate by combining the "White List" and the "Black List" to reduce the miss rate. It can also use other technologies such as source address authentication to make it a more accurate spam filter.

 

 

Source: http://www.5dmail.net/html/2006-5-18/2006518234548.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.