Use Naive Bayes for spam Classification

Source: Internet
Author: User

Bayesian formulas describe the relationship between conditional probabilities.

In machine learning, Bayesian formulas can be applied to classification issues. This article is based on my own learning and uses an example of spam classification to deepen my understanding of the theory.


Here we will explainSimplicityThe meaning of this word:

1) Each feature is independent of each other, and its appearance is irrelevant to its appearance sequence;

2) each feature is equally important;

The above are relatively strong assumptions.


The process of Naive Bayes classification is as follows:



In this way, we obtain the conditional probabilities for each of these features, which is intuitive. The joint probability distribution for each feature is that the conditional probabilities are multiplied, as shown in the preceding formula. However, the following problems may occur:

1) if a word is not in the dictionary, the conditional probability is 0, and the overall joint probability is 0.

The Laplace smoothing operation is introduced here: assume that each feature in the input sample appears at least 1 times, so that a feature is obtained

When the current probability is used, the denominator must be added.The general category M can be expressed as the following formula,

P (w | H) = (actual number of occurrences + 1)/(total number of occurrences of features + M)


2) another problem is that if there are many features in a sample, this may happen. If a single feature has a low probability, then the Union

When probability is multiplied, the final value is very small and may overflow in the computer. To avoid this situation, you can take the logarithm of the Union probability.

Log (A * B) = Log (A) + Log (B)

The above formula can be converted:

These are common problems during the training process. After training, you can get many such formulas, so a new email is sent.

How can I determine whether it is a spam?

Here we will discuss how to convert features like words into numbers that can be easily processed by computers. What is intuitive is to create a dictionary (vector) of words that often appear in known spam ). For a new mail, you can convert it to a vector of the same size as the dictionary. The words that appear are marked as '1' at the corresponding index; otherwise, the words are marked as '0 '. The next step is to multiply the obtained vector with the logarithm probability obtained by training.


The following is the Python code from the machine learning practice book.

From numpy import * def loaddataset (): postinglist = [['my', 'Dog', 'has', 'flea', 'problems ', 'help ', 'please'], ['maybe', 'not', 'Take ', 'him', 'to', 'Dog', 'Park', 'stupid '], ['my', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love', 'him'], ['stop ', 'posting', 'stupid ', 'Worthless', 'garbage'], ['Mr ', 'lick', 'ate', 'my', 'steak ', 'who', 'to', 'stop', 'him'], ['quit', 'bucket', 'Worthless ', 'Dog', 'foo D ', 'stupid'] classvec = [,] #1 is abusive, 0 not return postinglist, classvec def createvocablist (Dataset ): vocabset = set ([]) # create empty set for document in Dataset: vocabset = vocabset | set (document) # union of the two sets return list (vocabset) def setofwords2vec (vocablist, inputset): returnvec = [0] * Len (vocablist) for word in inputset: If word in vocablist: returnvec [vocablist. IND Ex (Word)] = 1 else: Print "the word: % s is not in my vocabulary! "% Word return returnvecdef trainnb0 (trainmatrix, traincategory): numtraindocs = Len (trainmatrix) numwords = Len (trainmatrix [0]) pabusive = sum (traincategory)/float (numtraindocs) p0num = ones (numwords); p1num = ones (numwords) # change to ones () p0denom = 2.0; p1denom = 2.0 # change to 2.0 for I in range (numtraindocs ): if traincategory [I] = 1: p1num + = trainmatrix [I] p1denom + = sum (trainmatrix [I]) else: p0num + = trainmatrix [I] p0denom + = sum (trainmatrix [I]) p1vect = Log (p1num/p1denom) # change to log () p0vect = Log (p0num/p0denom) # change to log () return p0vect, p1vect, pabusivedef classifynb (vec2classify, p0vec, p1vec, pclass1): p1 = sum (vec2classify * p1vec) + Log (pclass1) # element-wise mult p0 = sum (vec2classify * p0vec) + Log (1.0-pclass1) If P1> P0: return 1 else: Return 0


For more information, refer to the following blog:

Application of Naive Bayes classifier:

Bayesian inference and Internet applications: filtering spam







Use Naive Bayes for spam Classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.