Bayesian filter to filter spam

What is Bayesian filter?

Spam is a headache that bothers all Internet users.

It is very difficult to correctly identify spam. Traditional junk mail filtering methods include keyword and verification code. The filtering of the former is based on specific words; the latter is to calculate the verification code of the mail text, and then compare it with known spam. They have unsatisfactory recognition effects and are easy to avoid.

In 2002, Paul Graham proposed to use Bayesian inference to filter spam. He said the effect was incredible. 1000 million spam mails can be filtered out, and none of them are mistaken.

In addition, this filter also provides the self-learning function, which will be constantly adjusted based on new emails. The more spam you receive, the higher the accuracy.

Create a historical database

Bayesian filter is a statistical filter built on existing statistical results. Therefore, we must provide two groups of emails that have been identified in advance. One group is normal and the other group is spam.

We use these two groups of emails to "train" the filter ". The larger the two groups of emails, the better the training effect. Paul Graham uses 4000 normal mails and spam mails each.

The training process is simple. First, parse all emails and extract each word. Then, calculate the frequency of each word in normal mail and spam mail. For example, if we assume that 4000 of the 200 spam mails contain the word "sex", the frequency of occurrence is 5%. In 4000 normal mails, if there are only two containing words, the occurrence frequency is 0.05%. ([Note] if a word appears only in spam, Paul Graham assumes that it appears frequently in normal mail at 1%, and vice versa. This is done to avoid a probability of 0. The calculation result is automatically adjusted as the number of emails increases .)

With this preliminary statistical result, the filter can be put into use.

Application Process of Bayesian filter

Now we have received a new email. Before statistical analysis, we assume that the probability of being spam is 50%. ([Note] studies show that 80% of emails received by users are spams. However, we still assume that the "prior probability" of spam is 50% .)

We use s to indicate spam and H to indicate normal mail (healthy ). Therefore, the prior probabilities of P (S) and P (h) are both 50%.

Then, I analyzed the email and found that it contains the word sex. How likely is this email to be Spam?

We use W to represent the word "sex", and the question is how to calculate the value of P (S | W), that is, when a word (w) already exists, what is the probability of spam (s.

According to the conditional probability formula, you can write it immediately

In the formula, P (w | S) and P (w | h) indicate the probability of occurrence of this word in spam mail and normal mail respectively. The two values can be obtained from the historical database. for sex, we assume they are 5% and 0.05% respectively. In addition, the values of P (S) and P (h) are equal to 50%. Therefore, the P (S | W) value can be calculated immediately:

Therefore, the probability of this new email being spam is 99%. This shows that the word sex has a strong inference ability, and increases the "anterior probability" of 50% to the "posterior probability" of 99% at once ".

Calculation of joint probability

After completing the above steps, can we draw a conclusion that this new email is a spam email?

The answer is no. Because an email contains many words, some words (such as sex) say this is spam, and others say this is not. How do you know which word prevails?

Paul Graham chose the top 15 words in the letter P (S | W) to calculate their union probability. ([Note] if some words appear for the first time and cannot calculate P (S | W), Paul Graham assumes that the value is equal to 0.4. Because Spam often uses some fixed words, if you have never seen a word, it is mostly a normal word .)

The so-called joint probability refers to the probability of another event when multiple events occur. For example, if we know that W1 and W2 are two different words and both of them appear in an email, the probability of this email being spam is the probability of union.

When W1 and W2 are known, there are only two results: Spam (event E1) or normal mail (event E2 ).

The probabilities of W1, W2, and spam are as follows:

If it is assumed that all events are independent events ([note] strictly speaking, this assumption is not true, but can be ignored here), then P (E1) and P (E2) can be calculated ):

In addition, when W1 and W2 have already occurred, the probability of spam is equal to the following formula:

That is

Input P (s) to 0.5 to obtain

Record P (S | W1) as P1, P (S | W2) as P2, and the formula becomes

This is the formula for calculating the joint probability. If you do not understand it very well, click here to view more explanations.

Final calculation formula

When the above formula is extended to 15 words, the final probability calculation formula is obtained:

If an email is not spam, use this formula for calculation. In this case, we also need a threshold value for comparison. Paul Graham's threshold value is 0.9, with a probability greater than 0.9, indicating joint identification of 15 words. More than 90% of this email may belong to spam mail; if the probability is less than 0.9, it means it is a normal mail.

With this formula, a normal letter will not be considered as spam even if the word sex appears.

Reference:

Http://www.ruanyifeng.com/blog/2011/08/bayesian_inference_part_two.html

Bayesian filter to filter spam