Use Naive Bayes for spam Classification

Last Update:2014-10-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bayesian formulas describe the relationship between conditional probabilities.

In machine learning, Bayesian formulas can be applied to classification issues. This article is based on my own learning and uses an example of spam classification to deepen my understanding of the theory.

Here we will explainSimplicityThe meaning of this word:

1) Each feature is independent of each other, and its appearance is irrelevant to its appearance sequence;

2) each feature is equally important;

The above are relatively strong assumptions.

The process of Naive Bayes classification is as follows:

In this way, we obtain the conditional probabilities for each of these features, which is intuitive. The joint probability distribution for each feature is that the conditional probabilities are multiplied, as shown in the preceding formula. However, the following problems may occur:

1) if a word is not in the dictionary, the conditional probability is 0, and the overall joint probability is 0.

The Laplace smoothing operation is introduced here: assume that each feature in the input sample appears at least 1 times, so that a feature is obtained

When the current probability is used, the denominator must be added.The general category M can be expressed as the following formula,

P (w | H) = (actual number of occurrences + 1)/(total number of occurrences of features + M)

2) another problem is that if there are many features in a sample, this may happen. If a single feature has a low probability, then the Union

When probability is multiplied, the final value is very small and may overflow in the computer. To avoid this situation, you can take the logarithm of the Union probability.

Log (A * B) = Log (A) + Log (B)

The above formula can be converted:

These are common problems during the training process. After training, you can get many such formulas, so a new email is sent.

How can I determine whether it is a spam?

Here we will discuss how to convert features like words into numbers that can be easily processed by computers. What is intuitive is to create a dictionary (vector) of words that often appear in known spam ). For a new mail, you can convert it to a vector of the same size as the dictionary. The words that appear are marked as '1' at the corresponding index; otherwise, the words are marked as '0 '. The next step is to multiply the obtained vector with the logarithm probability obtained by training.

The following is the Python code from the machine learning practice book.

From numpy import * def loaddataset (): postinglist = [['my', 'Dog', 'has', 'flea', 'problems ', 'help ', 'please'], ['maybe', 'not', 'Take ', 'him', 'to', 'Dog', 'Park', 'stupid '], ['my', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love', 'him'], ['stop ', 'posting', 'stupid ', 'Worthless', 'garbage'], ['Mr ', 'lick', 'ate', 'my', 'steak ', 'who', 'to', 'stop', 'him'], ['quit', 'bucket', 'Worthless ', 'Dog', 'foo D ', 'stupid'] classvec = [,] #1 is abusive, 0 not return postinglist, classvec def createvocablist (Dataset ): vocabset = set ([]) # create empty set for document in Dataset: vocabset = vocabset | set (document) # union of the two sets return list (vocabset) def setofwords2vec (vocablist, inputset): returnvec = [0] * Len (vocablist) for word in inputset: If word in vocablist: returnvec [vocablist. IND Ex (Word)] = 1 else: Print "the word: % s is not in my vocabulary! "% Word return returnvecdef trainnb0 (trainmatrix, traincategory): numtraindocs = Len (trainmatrix) numwords = Len (trainmatrix [0]) pabusive = sum (traincategory)/float (numtraindocs) p0num = ones (numwords); p1num = ones (numwords) # change to ones () p0denom = 2.0; p1denom = 2.0 # change to 2.0 for I in range (numtraindocs ): if traincategory [I] = 1: p1num + = trainmatrix [I] p1denom + = sum (trainmatrix [I]) else: p0num + = trainmatrix [I] p0denom + = sum (trainmatrix [I]) p1vect = Log (p1num/p1denom) # change to log () p0vect = Log (p0num/p0denom) # change to log () return p0vect, p1vect, pabusivedef classifynb (vec2classify, p0vec, p1vec, pclass1): p1 = sum (vec2classify * p1vec) + Log (pclass1) # element-wise mult p0 = sum (vec2classify * p0vec) + Log (1.0-pclass1) If P1> P0: return 1 else: Return 0

For more information, refer to the following blog:

Application of Naive Bayes classifier:

Bayesian inference and Internet applications: filtering spam

Use Naive Bayes for spam Classification

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Naive Bayes for spam Classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Naive Bayes for spam Classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support