Naive Bayes (naive Bayesian) belongs to the supervised learning algorithm, which determines the classification of the test sample by calculating the probability of the test sample in each classification of the training sample, and takes the maximum probability for its classification.
| Advantages |
Still valid with less data, can handle multiple categories of problems |
| Disadvantages |
Sensitive to the way the input data is prepared |
| Applicable data types |
Nominal type |
Basic concepts
1. Conditional probabilities
P (a| b) indicates the probability that event a will occur if event B has occurred, that is, the conditional probability of event a under event B.
The calculation formula is:
2. Bayesian formula
When P (a| B) relatively easy to calculate, P (b| A) The Bayesian formula can be used when the comparison is difficult to calculate.
The calculation formula is:
Algorithm description
1. the core of the algorithm is the calculation of P (ci|w), where W is a test sample, Ci is a certain classification, that is, the probability of calculating w belongs to CI, which probability is large, which Ci belongsto; The calculation formula is:
2. P (w) is a fixed value for a test sample, so it is not considered here, just consider the denominator. (Code 72, line 73)
3. P (Ci) represents the probability of a classification in a test sample, which can also be obtained directly. (That is, pabusive in the code)
4. P (w| Ci) In this article refers to a document, which consists of each word W0, W1, W2 ... Composition, W0, W1, W2 ... Are independent of each other, so the following formula is established:
5. If every time a test document W comes in, split it into W0, W1, W2. Again to calculate P (w0| Ci), P (w1| Ci), P (w2| CI) ..., it will greatly reduce the efficiency, so should be in the training phase of all occurrences of the word Wj the probability of each Ci P (wj| Ci) is calculated, and can be used directly when testing.
6. P (wj| CI) is calculated by dividing the number of occurrences of Wj in CI by the total number of words in CI. (Code 62nd, line 63)
Algorithm flowchart
#-*-coding:utf-8-* fromNumPy Import *# Load data with good words def loaddataset (): Postinglist=[['my','Dog',' has','Flea','problems',' Help',' please'], ['maybe',' not',' Take','him',' to','Dog','Park','Stupid'], ['my','dalmation',' is',' So','Cute','I',' Love','him'], ['Stop','Posting','Stupid','Worthless','Garbage'], ['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'], ['quit','Buying','Worthless','Dog',' Food','Stupid']] # Here indicate whether the individual items in Postinglist are insulting language #0Not insulting language,1says it's Classvec .= [0,1,0,1,0,1] returnPostinglist,classvec # Saves the word in the DataSet to the list and removes all duplicate words def createvocablist (dataSet): Vocabse T=Set([]) #create emptySet forDocumentinchDataset:vocabset= Vocabset |Set(document) #union of the setsreturnlist (vocabset) # turns inputset into a vector, a vector of length len (vocablist) # in the same place as the word in Inputset1, the rest is0def setofwords2vec (Vocablist, Inputset): Returnvec= [0]*Len (vocablist) forWordinchInputset:ifWordinchVocablist:returnvec[vocablist.index (word)]=1 Else: Print"The Word:%s is isn't in my vocabulary!"%WordreturnReturnvec # naive Bayesian core training function, training results for all words in each classification of the probability that P (w|Ci) # This function finally gets two arrays of length n, n is the number of Trainmatrix, that is, the number of words in all documents # P1vect/p0vect is the probability of each word appearing as an insulting/non-insulting document for any document, P (w|Ci) # returns the pabusive as a percentage of the total statement for the insulting statement, that is, P (Ci) def trainNB0 (trainmatrix,traincategory): Numtraindocs=Len (Trainmatrix) # Number of lines Numwords= Len (trainmatrix[0] # Number of columns # because in Traincategory1is insulting, for0is not, so after sum is an insulting number # and divided by the total number of sets, it is the percentage of insults that pabusive= SUM (traincategory)/float(numtraindocs) # P1num/p0num for all insulting/non-insulting statements corresponding to vector and # p1denom/p0denom for all insulting/the number of words contained in a non-insulting statement p0num= Ones (numwords); P1num = Ones (numwords) # All for1, avoid multiplying0P0denom=2.0; P1denom =2.0# Change to2.0 forIinchRange (Numtraindocs):ifTraincategory[i] = =1: P1num+=Trainmatrix[i] P1denom+=sum (trainmatrix[i])Else: P0num+=Trainmatrix[i] P0denom+=sum (trainmatrix[i]) P1vect= Log (p1num/p1denom) # to calculate the logarithm, to avoid the multiplication of the minimum number, the final 0 p0vect= Log (p0num/p0denom) # Change to log ()returnp0vect,p1vect,pabusive # Classify the vec2classify to see if it belongs to P1vec or P0vecdef classifynb (vec2classify, P0vec, P1vec, PClass1 ): # Log to add, is actually the Bayes formula molecule in the multiplication # because vec2classify for the word that appears1, the words that do not appear are0# So multiplying the two is actually P (w|Ci) P1= SUM (vec2classify * P1vec) + log (PCLASS1) #element-wise mult P0= SUM (vec2classify * P0vec) + log (1.0-PClass1)ifP1 >P0:return 1 Else: return 0# Another Inputset generate Vector method (with Setofwords2vec) # Here's where every word appears once, right where it is+1# Setofwords2vec is for words that have occurred, set the corresponding location to1def bagofwords2vecmn (Vocablist, Inputset): Returnvec= [0]*Len (vocablist) forWordinchInputset:ifWordinchVocablist:returnvec[vocablist.index (word)]+=1 returnReturnvec # test Naive Bayes def TESTINGNB (): # Load data and vectorization # each list0posts generates a vector, and the last Trainmat is a matrix listoposts,l Istclasses=loaddataset () myvocablist=createvocablist (listoposts) Trainmat=[] forPostindocinchlistOPosts:trainMat.append (Setofwords2vec (Myvocablist, Postindoc)) # training P0v,p1v,pab=trainNB0 (Array (trainmat), Array (listclasses)) # test1, the result should be non-insulting testentry= [' Love','my','dalmation'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry)) print Testentry,'classified as:', Classifynb (thisdoc,p0v,p1v,pab) # test2, the result should be an insult to testentry= ['Stupid','Garbage'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry)) print Testentry,'classified as:', Classifynb (thisdoc,p0v,p1v,pab) # Splits a string by space, taking only a length greater than2the word def textparse (bigstring): #input isBigstring, #output isword list import re listoftokens= Re.split (r'\w*', bigstring)return[Tok.lower () forTokinchListoftokensifLen (tok) >2] if__name__ = ="__main__": TESTINGNB ()
Description
This article is for "machine leaning in Action" chapter fourth (classifying with probability theory:naïve Bayes) reading notes, the code is slightly modified and annotated.
Good literature Reference
1. The naive Bayesian classification of the algorithm grocer-classification algorithm (Naive Bayesian classification)
Reprint http://my.oschina.net/zenglingfan/blog/177517
Naive Bayes Notes