Naive Bayesian python implementation

Last Update:2014-10-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Probability theory is the basis of many machine learning algorithms, naive Bayesian classifier is called simplicity, because the entire formalization process only to do the most primitive, simple hypothesis. (This hypothesis: there are many characteristics of the problem, we simply assume that a feature is independent, the hypothesis is that the conditions of independence, in fact, is often not completely independent of the actual problem, then need to use another method called Bayesian network), the use of naive Bayesian method, we apply in the spam filtering problem.

The classification method of Bayesian decision theory:

Pros: Less data is still valid and can handle multiple categories of problems.

Cons: The way to prepare for input data is more sensitive, and I understand that it's beginning to prepare each sample set that has been sorted

Data type: Nominal data (nominal values provide only enough information to distiguish one object from another = or |=)

The theoretical basis is that the Bayesian formula we learned in the mathematical statistics course is not repeated here, and we classify it by calculating conditional probabilities.

Python for text categorization:

Def loaddataset (): postinglist=[[' my ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' please ', [' maybe ', ' Not ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' my ', ' dalmation ', ' was ', ' so ', ' cute ', ' I ', ' love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '], [' Mr ', ' licks ', ' ate ', ' my ' , ' steak ', ' How ', ' to ', ' stop ', ' him '), [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']] class    VEC = [0,1,0,1,0,1] #1 is abusive, 0 not return Postinglist,classvec def createvocablist (dataSet): Vocabset = set ([]) #create empty set for document in Dataset:vocabset = Vocabset | Set (document) #union of the Sets return list (Vocabset) def setofwords2vec (Vocablist, inputset): Returnvec = [0]*l        En (vocablist) for word in Inputset:if word in vocablist:returnvec[vocablist.index (word)] = 1 Else:print "The word:%s is not in my vocabulary! "% word return Returnvec

The first function, Loaddataset (), creates sample data, each of which is categorized, and 1 insults 0 as normal speech.

The next function is to create a list of non-repeating words that appear in all documents.

The third function sets the vector, if the word appears in Vector 1, does not appear as 0.

def trainNB0 (trainmatrix,traincategory):    numtraindocs = Len (trainmatrix)    numwords = Len (trainmatrix[0])    pabusive = SUM (traincategory)/float (numtraindocs)    p0num = ones (numwords); p1num = Ones (numwords)      #change  To ones ()     p0denom = 2.0, P1denom = 2.0                        #change to 2.0 for    I in range (Numtraindocs):        if traincategory[i] = = 1:            P1num + = Trainmatrix[i]            p1denom + = SUM (Trainmatrix[i])        else:            p0num + = Trainmatrix[i]            P0denom + = SUM (Trainmatrix[i])    p1vect = log (p1num/p1denom)          #change to log ()    p0vect = log (p0num/p0denom)          #change to log ()    return p0vect,p1vect,pabusive

In this function, two optimizations need to be noted: the initialization matrix is 1, to avoid the probability of an attribute is 0 leads to the overall probability of 0, there is a probability of multiplication after the value will be relatively small, take log convenient comparison.

Test the Code Function section:

def classifynb (Vec2classify, P0vec, P1vec, PClass1):    p1 = SUM (vec2classify * P1vec) + log (pClass1)    #element-wise mult    p0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1)    if p1 > P0:        return 1    else:         return 0def tes TINGNB ():    listoposts,listclasses = Loaddataset ()    myvocablist = createvocablist (listoposts)    trainMat=[] For    Postindoc in listoposts:        trainmat.append (Setofwords2vec (Myvocablist, Postindoc))    p0v,p1v,pab = TrainNB0 (Array (trainmat), Array (listclasses))    testentry = [' Love ', ' my ', ' dalmation ']    thisdoc = Array ( Setofwords2vec (Myvocablist, testentry))    print testentry, ' classified as: ', CLASSIFYNB (THISDOC,P0V,P1V,PAB)    testentry = [' stupid ', ' garbage ']    thisdoc = Array (Setofwords2vec (Myvocablist, testentry))    Print Testentry, ' classified as: ', CLASSIFYNB (THISDOC,P0V,P1V,PAB)

Application section: Using naive Bayes to filter spam, use to cross-validation.

Data preparation will have spam folder under all the marked spam, Ham is normal mail.

def textparse (bigstring): #input is big string, #output is word list import re listoftokens = Re.split (R ' \w* ', bi gstring) return [Tok.lower () for Tok in Listoftokens if Len (tok) > 2] def spamtest (): doclist=[]; Classlist = []; Fulltext =[] for I in range (1,26): WordList = textparse (open (' email/spam/%d.txt '% i). Read ()) Doclist.app End (WordList) fulltext.extend (wordList) classlist.append (1) wordList = textparse (open (' email/ham/%d.t XT '% i). Read ()) Doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist = Createvocablist (docList) #create Vocabulary trainingset = range (50);        testset=[] #create Test set for I in range: randindex = Int (Random.uniform (0,len (trainingset))) Testset.append (Trainingset[randindex]) del (Trainingset[randindex]) trainmat=[];  Trainclasses = [] for docindex in Trainingset: #train the classifier (get probs) trainNB0      Trainmat.append (BAGOFWORDS2VECMN (Vocablist, Doclist[docindex)) Trainclasses.append (ClassList[docIndex]) p 0v,p1v,pspam = trainNB0 (Array (trainmat), Array (trainclasses)) Errorcount = 0 for Docindex in Testset: #classif Y the remaining items wordvector = BAGOFWORDS2VECMN (Vocablist, Doclist[docindex]) if CLASSIFYNB (Array (wordve ctor), p0v,p1v,pspam)! = Classlist[docindex]: Errorcount + = 1 print "Classification error", Doclist[do CIndex] print ' The error rate is: ', float (errorcount)/len (testset) #return Vocablist,fulltext

Textparse accepts a string to parse into a list of strings.

Spamtest selected 10 messages in 50 messages randomly selected for test set cross-validation.

All code is summarized in a bayes.py file:

From NumPy import *def loaddataset (): postinglist=[[' my ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' please ', [' Maybe ', ' not ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' my ', ' dalmation ', ' are ', ' so ', ' C ' Ute ', ' I ', ' love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '], [' Mr ', ' Licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid '] Classvec = [0,1,0,1,0,1] #1 is abusive, 0 not return Postinglist,classvec def Createvo CabList (dataSet): Vocabset = set ([]) #create empty set for document in Dataset:vocabset = Vocabset | Set (document) #union of the Sets return list (Vocabset) def setofwords2vec (Vocablist, inputset): Returnvec = [0]*l        En (vocablist) for word in Inputset:if word in vocablist:returnvec[vocablist.index (word)] = 1 Else:print "The Word:%s isNot in my vocabulary! "% word return returnvecdef trainNB0 (trainmatrix,traincategory): Numtraindocs = Len (trainmatri x) numwords = Len (trainmatrix[0]) pabusive = SUM (traincategory)/float (numtraindocs) p0num = ones (numwords); P1num = Ones (numwords) #change to ones () P0denom = 2.0;            P1denom = 2.0 #change to 2.0 for I in Range (Numtraindocs): if traincategory[i] = = 1:            P1num + = Trainmatrix[i] P1denom + = SUM (Trainmatrix[i]) Else:p0num + = Trainmatrix[i] P0denom + = SUM (trainmatrix[i]) P1vect = log (p1num/p1denom) #change to log () P0vect = log (p0num/p0 denom) #change to log () return p0vect,p1vect,pabusivedef classifynb (vec2classify, P0vec, P1vec, PClass1): P    1 = SUM (vec2classify * P1vec) + log (pClass1) #element-wise mult p0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1) If p1 > P0:return 1 else:return 0def testingnb (): Listoposts,listclasses = Loaddataset () myvocablist = Createvocablist (listoposts) trainmat=[] for Postindoc in listOPosts : Trainmat.append (Setofwords2vec (Myvocablist, postindoc)) P0v,p1v,pab = trainNB0 (Array (trainmat), Array (listclass ES) testentry = [' Love ', ' my ', ' dalmation '] Thisdoc = Array (Setofwords2vec (Myvocablist, testentry)) print Testen Try, ' classified as: ', CLASSIFYNB (thisdoc,p0v,p1v,pab) testentry = [' stupid ', ' garbage '] Thisdoc = Array (setofwords2v EC (Myvocablist, testentry)) print testentry, ' classified as: ', Classifynb (THISDOC,P0V,P1V,PAB) def bagofwords2vecmn ( Vocablist, inputset): Returnvec = [0]*len (vocablist) for word in Inputset:if word in vocablist:r  Eturnvec[vocablist.index (word)] + = 1 return returnvecdef textparse (bigstring): #input is big string, #output is Word List import Re listoftokens = Re.split (R ' \w* ', bigstring) return [Tok.lower () for Tok in Listoftokens if Len (tok ) > 2] def spamtest ():    Doclist=[]; Classlist = []; Fulltext =[] for I in range (1,26): WordList = textparse (open (' email/spam/%d.txt '% i). Read ()) Doclist.app End (WordList) fulltext.extend (wordList) classlist.append (1) wordList = textparse (open (' email/ham/%d.t XT '% i). Read ()) Doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist = Createvocablist (docList) #create Vocabulary trainingset = range (50);        testset=[] #create Test set for I in range: randindex = Int (Random.uniform (0,len (trainingset))) Testset.append (Trainingset[randindex]) del (Trainingset[randindex]) trainmat=[]; Trainclasses = [] for docindex in Trainingset: #train the classifier (get probs) trainNB0 Trainmat.append (bagofwo RDS2VECMN (Vocablist, Doclist[docindex]) trainclasses.append (Classlist[docindex]) P0v,p1v,pspam = trainNB0 (array (Trainmat), Array (trainclasses)) Errorcount = 0 for DocindexIn Testset: #classify the remaining items Wordvector = BAGOFWORDS2VECMN (Vocablist, Doclist[docindex]) If CLASSIFYNB (Array (wordvector), p0v,p1v,pspam)! = Classlist[docindex]: Errorcount + = 1 print "Class Ification error ", Doclist[docindex] print ' The error rate is: ', float (errorcount)/len (testset) #return Vocablist,full    Text if __name__ = = "__main__": listoposts,listclasses = Loaddataset () myvocablist = Createvocablist (listOPosts) Print Myvocablist Trainmat = [] for Postindoc in ListOPosts:trainMat.append (Setofwords2vec (Myvocablist, p Ostindoc)) P0v,p1v,pab = TrainNB0 (Trainmat, listclasses) TESTINGNB () spamtest ()

Naive Bayesian python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Naive Bayesian python implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Naive Bayesian python implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support