Probability-based classification method: Naive Bayesian

Bayesian decision theory

Naive Bayes is part of the Bayesian decision theory, so let's take a quick and easy look at Bayesian decision theory before we talk about naive Bayes.

The core idea of Bayesian decision-making theory : Choose the decision with the highest probability. For example, we graduate to choose the direction of employment, the probability of choosing C + + is 0.3, the probability of choosing Java is 0.2, the probability of choosing machine learning is 0.5. Then we will classify such a graduate employment direction as machine learning direction.

Conditional probabilities

What is conditional probability? the probability of occurrence of event A in a condition known to occur in another event B, recorded as P (a| b), read "Probability of a under B condition".

Example 1: There are two dice, a dice throw out points is 6, then throw another dice, the sum of two dice points is greater than or equal to 10 probability.

Example 2: There are three boxes numbered for three-to-one. Box 1th has 1 red ball 4 white ball, 2nd box has 2 red ball 3 white ball, 3rd box has 3 red ball. Take a box from three boxes and draw a ball from any of them to get the probability of getting the red ball.

Another method to effectively calculate conditional probabilities is Bayesian criterion . The Bayesian guidelines tell us how to exchange conditions and results in conditional probabilities, that is, if P (x|c) is known, P (c|x) is required, at which point we can use this method of calculation: P (c|x) *p (x) =p (x|c) *p (c).

So far, we have basically learned Bayesian decision theory and conditional probabilities, we began to learn to write code, in Python language to try to implement the naïve Bayesian classifier bar.

naive Bayesian classifier

Take message board for example, in order not to affect the development of the community, we want to block insulting speech, so to build a fast filter, if a message using negative or insulting words, the message is identified as inappropriate content. Filtering this type of content is a basic requirement for many websites. Here, we classify the words as insulting and non-insulting, with 1 and zero respectively.

1 #Coding:utf-82 fromNumPyImport*3 4 defLoaddataset ():5Postinglist = [['my','Dog',' has','Flea', 6 'problems',' Help',' please'],7['maybe',' not',' Take','him', 8 ' to','Dog','Park','Stupid'],9['my','dalmation',' is',' So','Cute', Ten 'I',' Love','him'], One['Stop','Posting','Stupid','Worthless','Garbage'], A['Mr','Licks','ate','my','Steak',' How' - ' to','Stop','him'], -['quit','Buying','Worthless','Dog',' Food','Stupid'] the ] -Classvec = [0,1,0,1,0,1] - returnpostinglist, Classvec - + defcreatevocablist (dataSet): -Vocabset =set ([]) + forDocinchDataSet: AVocabset = Vocabset |Set (DOC) at returnlist (Vocabset) - - defSetofword2vec (Vocablist, inputset): -Returnvec = [0]*Len (vocablist) - forWordinchInputset: - ifWordinchvocablist: inReturnvec[vocablist.index (word)] = 1 - Else: to Print "The Word:%s is isn't in my vocabulary!"%Word + returnReturnvec -

The first function, Loaddataset (), creates some experimental samples. The first variable returned by the function is a collection of documents that are message boards from the dog. The second variable is a collection of category labels. There are two categories, namely, insulting and non-insulting.

The next function, Createvocablist (), creates a list of non-repeating words that appear in all documents, using the Set data type. Operator "|" Used to find a set of two sets.

The input parameters of the third function are the vocabulary list and a document, the output is the document vector, each element of the vector is 1 or 0, which indicates whether the words in the glossary appear in the input document.

Let's take a look at how this function block works.

1 #Coding:utf-82 ImportBayes3 fromNumPyImport*4 5list2posts, listclasses =Bayes.loaddataset ()6Myvocablist =bayes.createvocablist (list2posts)7 Printmyvocablist, Len (myvocablist)8 PrintBayes.setofword2vec (Myvocablist, list2posts[0])9 PrintBayes.setofword2vec (Myvocablist, list2posts[3])

Calculate probability

1 defTrainnbo (Trainmatrix, traincategory):2Numtraindocs =Len (Trainmatrix)3Numwords =Len (trainmatrix[0])4pabusive = SUM (traincategory)/float (numtraindocs)5P0num = zeros (numwords); P1num =zeros (numwords)6P0denom = P1denom = 0.07 forIinchRange (Numtraindocs):8 ifTraincategory[i] = = 1:9P1num + =Trainmatrix[i]TenP1denom + =sum (trainmatrix[i]) One Else : AP0num + =Trainmatrix[i] -P0denom + =sum (trainmatrix[i]) -P1vect = p1num/P1denom theP0vect = p0num/P0denom - returnP0vect,p1vect,pabusive

Everything is ready, only owed the East wind. Next, is the important naive Bayesian classification function .

1 defclassifynb (vec2classify, P0vec, P1vec, PClass1):2P1 = SUM (vec2classify * P1vec) +log (PCLASS1)3P0 = SUM (vec2classify * P0vec) + log (1.0-PClass1)4 ifP1 >P0:5 return16 Else :7 return08 9 defTESTINGNB ():TenListoposts,listclasses =Loaddataset () OneMyvocablist =createvocablist (listoposts) ATrainmat = [] - forPostindocinchlistoposts: - trainmat.append (Setofword2vec (myvocablist,postindoc)) theP0v,p1v,pab =Trainnbo (Array (trainmat), Array (listclasses)) -Testentry = [' Love','my','dalmation'] -Thisdoc =Array (Setofword2vec (Myvocablist, testentry)) - PrintTestentry,'classified as:', CLASSIFYNB (thisdoc,p0v,p1v,pab) +Testentry = ['Stupid','Garbage'] -Thisdoc =Array (Setofword2vec (Myvocablist, testentry)) + PrintTestentry,'classified as:', CLASSIFYNB (THISDOC,P0V,P1V,PAB)

Try the results of our calculations:

Like the results we expected, the document words were correctly categorized.

Summary:

Naive Bayesian algorithm is more complex than decision tree and KNN algorithm in general, and the code quantity is relatively more. Recall that I was a freshman, did not study the probability theory, at that time did not pay attention to it, leading to today's bird-like. Alas, it was still young! The conditional probability section has many more complex probability formulas that are still very vague, and the next one takes a bit of time in terms of probability. As long as there is no problem with the probability, I believe that reading these several code should not be a problem, the CLASSIFYNB function is to calculate the final probability.

Come on, baiyishaonian!.

Machine learning four--a classification method based on probability theory: Naive Bayes