In this paper, the Python implementation method of naive Bayesian algorithm is described. Share to everyone for your reference. The implementation method is as follows:
Advantages and disadvantages of naive Bayesian algorithm
Pros: Still effective with less data, can handle multiple categories of problems
Cons: Sensitive to the way the input data is prepared
Applicable data type: Nominal type data
Algorithm idea:
For example, we want to determine whether an e-mail message is spam, then we know the distribution of the word in this message, then we also need to know: spam in the presence of some words, you can use the Bayesian theorem obtained.
One hypothesis in naive Bayesian classifier is that each feature is equally important
Function
Loaddataset ()
Create a dataset where the dataset is a sentence of broken words that represents a user comment for a forum, and label 1 says it's a curse.
Createvocablist (DataSet)
Find out how many words are in total in these sentences to determine the size of our word vectors
Setofwords2vec (Vocablist, Inputset)
To convert a sentence into a vector based on the word, the Bernoulli model is used to consider whether the word exists
BAGOFWORDS2VECMN (Vocablist, Inputset)
This is another model of turning a sentence into a vector, a polynomial model that takes into account the number of occurrences of a word.
TrainNB0 (Trainmatrix,traincatergory)
Calculate P (i) and P (w[i]| C[1]) and P (w[i]| C[0]), here are two tricks, one is to start the numerator denominator not all initialized to 0 is to prevent one of the probability of 0 leads to the whole 0, and the other is the back multiply with logarithmic prevent because the accuracy problem result for 0
CLASSIFYNB (Vec2classify, P0vec, P1vec, PClass1)
Calculates the probability that the vector belongs to two sets according to the Bayesian formula.
The code is as follows:
#coding =utf-8
From numpy Import *
Def loaddataset ():
Postinglist=[[' my ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' please ',
[' Maybe ', ' not ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '],
[' My ', ' dalmation ', ' is ', ' so ', ' cute ', ' I ', ' love ', ' him '],
[' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '],
[' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '],
[' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']
Classvec = [0,1,0,1,0,1] #1 is abusive, 0 not
Return Postinglist,classvec
#创建一个带有所有单词的列表
def createvocablist (DataSet):
Vocabset = set ([])
For document in DataSet:
Vocabset = Vocabset | Set (document)
Return list (Vocabset)
def setofwords2vec (Vocablist, Inputset):
Retvocablist = [0] * Len (vocablist)
For word in Inputset:
If Word in vocablist:
Retvocablist[vocablist.index (word)] = 1
Else
print ' word ', word, ' not in Dict '
Return retvocablist
#另一种模型
def bagofwords2vecmn (Vocablist, Inputset):
Returnvec = [0]*len (vocablist)
For word in Inputset:
If Word in vocablist:
Returnvec[vocablist.index (word)] + = 1
Return Returnvec
def trainNB0 (trainmatrix,traincatergory):
Numtraindoc = Len (Trainmatrix)
Numwords = Len (trainmatrix[0])
pabusive = SUM (traincatergory)/float (Numtraindoc)
#防止多个概率的成绩当中的一个为0
P0num = Ones (numwords)
P1num = Ones (numwords)
P0denom = 2.0
P1denom = 2.0
For I in Range (Numtraindoc):
If traincatergory[i] = = 1:
P1num +=trainmatrix[i]
P1denom + = SUM (Trainmatrix[i])
Else
P0num +=trainmatrix[i]
P0denom + = SUM (Trainmatrix[i])
P1vect = log (p1num/p1denom) #处于精度的考虑, otherwise it is possible to limit to zero
P0vect = log (p0num/p0denom)
Return p0vect,p1vect,pabusive
def classifynb (Vec2classify, P0vec, P1vec, PClass1):
P1 = SUM (vec2classify * P1vec) + log (pClass1) #element-wise mult
P0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1)
If p1 > P0:
Return 1
Else
return 0
Def TESTINGNB ():
listoposts,listclasses = Loaddataset ()
Myvocablist = Createvocablist (listoposts)
Trainmat=[]
For Postindoc in listoposts:
Trainmat.append (Setofwords2vec (Myvocablist, Postindoc))
P0v,p1v,pab = trainNB0 (Array (trainmat), Array (listclasses))
Testentry = [' Love ', ' my ', ' dalmation ']
Thisdoc = Array (Setofwords2vec (Myvocablist, Testentry))
Print Testentry, ' classified as: ', CLASSIFYNB (THISDOC,P0V,P1V,PAB)
Testentry = [' stupid ', ' garbage ']
Thisdoc = Array (Setofwords2vec (Myvocablist, Testentry))
Print Testentry, ' classified as: ', CLASSIFYNB (THISDOC,P0V,P1V,PAB)
def main ():
TESTINGNB ()
if __name__ = = ' __main__ ':
Main ()
Hopefully this article will help you with Python programming.