The principle of Atitit Bayesian algorithm and the principle of spam classification

Source: Internet
Author: User

the principle of atitit Bayesian algorithm and the principle of spam classification

1.1. the beginning of the spam Judgment method, using contain contains judgment, can only one keyword, and 100% probability judgment 1

1.2. Series Law of component parts 1

1.3. spam keyword series law visualization of Bayesian law 1

1.4. 11, Final calculation formula 2

1.5. At this point we also need a threshold for comparison. Paul Graham 's threshold is 0.9, with a probability greater than 0.9,2

1.1. in the case of event B, the probability that event A will occur is p (A∩B) divided by P (b). . 1.2. The beginning of the spam Judgment method, using contain contains the judgment, only one keyword, and 100% probability judgment

then certainly not applicable. So using the probabilistic algorithm, a spam word, such as an invoice, will be judged by the probability of 90%. For example, in the emergence of another rubbish word, such as purchase, then the judgment probability will rise to reach 9x% ...

1.3. serial Law of component parts

when the reliability of a component is 70% , then the reliability of the two components in series is reduced to reach 70%*70%=49% .

The law of parallel components. The following calculation methods can increase the reliability, specific increase of percentage points:

1.4. spam keyword series law table method to visualize Bayesian law

like if it appears Invoice this word, then the probability of this file junk file is 90%.

if appear buy this word, the probability of a junk file is 80%

get to the table below ---------- start ----------

Vocabulary

Junk e-mail probability

Normal message probability

Invoice

80D

10%

Buy

80%

20%

Purchase Invoice

90*80=72%(Discard this error structure

10*20=2%

Purchase Invoice

1-2%=98%(spam probability is calculated backwards based on normal message probability )

10*20=2%

-------- end of table =-------

Table Explanation: Several rules

first, if only the word "Invoice" appears, then spam probability is 90%, the normal message probability is naturally 1-90%==10%

2, if only the purchase of the Word, the spam probability is 80%, the normal message probability is naturally 1-80%==20%

in the third part, if the purchase invoice two times, the initial judgment spam probability is 90%*80%=72%, the normal message probability is naturally 10%*20%=2%

obviously, if more than one spam keyword is present. The probability of spam should rise just right. So discard the 72% error calculation results.

Fourth Step: Then the probability of getting the normal message is 2%. The natural spam probability is 1-2%==98% ...

1.5. 11, the final calculation formula

By extending the above equation to 15 words, the final probability formula is obtained:

P =1-(1-P1) * (1-P2) * (1-P3);

If an e-mail is spam, use this formula to calculate

1.6. . At this point we also need a threshold for comparison. Paul Graham 's threshold value is 0.9, the probability is greater than 0.9,

indicates 15 words joint cognizance, this email has more than 90% may belong to the spam, the probability is less than 0.9, indicates is the normal mail.

with this formula, a normal letter, even if the word sex is present,

1.7. solving F1 and F2 is a continuous variable, and it is not appropriate to calculate probabilities according to a particular value.

But here's a question: F1 and F2 are continuous variables and are not suitable for calculating probabilities according to a particular value.

One technique is to change a continuous value to a discrete value and calculate the probability of an interval. For example , F1 decomposition into [0, 0.05], (0.05, 0.2), [0.2, +∞] three intervals, and then calculate the probability of each interval. In our example, F1 equals 0.1 and falls in the second interval, so the probability of occurrence of the second interval is used when calculating.

Resources

application of naive Bayesian classifier - Nanyi blog . html

author::  Nickname :Old Wow's claws( Full Name::AttilaxAkbar Al Rapanui Attilaksachanui) 

Kanji Name: Etila ( Ayron) , email:[email protected]

reprint Please indicate source: http://www.cnblogs.com/attilax/

Atiend

The principle of Atitit Bayesian algorithm and the principle of spam classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.