Bayesian formulas describe the relationship between conditional probabilities.
In machine learning, Bayesian formulas can be applied to classification issues. This article is based on my own learning and uses an example of spam classification to deepen my understanding of the theory.
Here we will explainSimplicityThe meaning of this word:
1) Each feature is independent of each other, and its appearance is irrelevant to its appearance sequence;
2) each feature is equally important;
The above are relatively strong assumptions.
The process of Naive Bayes classification is as follows:
In this way, we obtain the conditional probabilities for each of these features, which is intuitive. The joint probability distribution for each feature is that the conditional probabilities are multiplied, as shown in the preceding formula. However, the following problems may occur:
1) if a word is not in the dictionary, the conditional probability is 0, and the overall joint probability is 0.
The Laplace smoothing operation is introduced here: assume that each feature in the input sample appears at least 1 times, so that a feature is obtained
When the current probability is used, the denominator must be added.The general category M can be expressed as the following formula,
P (w | H) = (actual number of occurrences + 1)/(total number of occurrences of features + M)
2) another problem is that if there are many features in a sample, this may happen. If a single feature has a low probability, then the Union
When probability is multiplied, the final value is very small and may overflow in the computer. To avoid this situation, you can take the logarithm of the Union probability.
Log (A * B) = Log (A) + Log (B)
The above formula can be converted:
These are common problems during the training process. After training, you can get many such formulas, so a new email is sent.
How can I determine whether it is a spam?
Here we will discuss how to convert features like words into numbers that can be easily processed by computers. What is intuitive is to create a dictionary (vector) of words that often appear in known spam ). For a new mail, you can convert it to a vector of the same size as the dictionary. The words that appear are marked as '1' at the corresponding index; otherwise, the words are marked as '0 '. The next step is to multiply the obtained vector with the logarithm probability obtained by training.
The following is the Python code from the machine learning practice book.
From numpy import * def loaddataset (): postinglist = [['my', 'Dog', 'has', 'flea', 'problems ', 'help ', 'please'], ['maybe', 'not', 'Take ', 'him', 'to', 'Dog', 'Park', 'stupid '], ['my', 'dalmation ', 'is', 'so', 'cute ',' I ', 'love', 'him'], ['stop ', 'posting', 'stupid ', 'Worthless', 'garbage'], ['Mr ', 'lick', 'ate', 'my', 'steak ', 'who', 'to', 'stop', 'him'], ['quit', 'bucket', 'Worthless ', 'Dog', 'foo D ', 'stupid'] classvec = [,] #1 is abusive, 0 not return postinglist, classvec def createvocablist (Dataset ): vocabset = set ([]) # create empty set for document in Dataset: vocabset = vocabset | set (document) # union of the two sets return list (vocabset) def setofwords2vec (vocablist, inputset): returnvec = [0] * Len (vocablist) for word in inputset: If word in vocablist: returnvec [vocablist. IND Ex (Word)] = 1 else: Print "the word: % s is not in my vocabulary! "% Word return returnvecdef trainnb0 (trainmatrix, traincategory): numtraindocs = Len (trainmatrix) numwords = Len (trainmatrix [0]) pabusive = sum (traincategory)/float (numtraindocs) p0num = ones (numwords); p1num = ones (numwords) # change to ones () p0denom = 2.0; p1denom = 2.0 # change to 2.0 for I in range (numtraindocs ): if traincategory [I] = 1: p1num + = trainmatrix [I] p1denom + = sum (trainmatrix [I]) else: p0num + = trainmatrix [I] p0denom + = sum (trainmatrix [I]) p1vect = Log (p1num/p1denom) # change to log () p0vect = Log (p0num/p0denom) # change to log () return p0vect, p1vect, pabusivedef classifynb (vec2classify, p0vec, p1vec, pclass1): p1 = sum (vec2classify * p1vec) + Log (pclass1) # element-wise mult p0 = sum (vec2classify * p0vec) + Log (1.0-pclass1) If P1> P0: return 1 else: Return 0
For more information, refer to the following blog:
Application of Naive Bayes classifier:
Bayesian inference and Internet applications: filtering spam
Use Naive Bayes for spam Classification