The Python implementation method of naive Bayesian algorithm _python

Source: Internet
Author: User

This paper illustrates the Python implementation method of naive Bayesian algorithm. Share to everyone for your reference. The implementation methods are as follows:

Advantages and disadvantages of naive Bayesian algorithm

Advantages: It is still valid in the case of less data, can deal with many kinds of problems

Disadvantage: Sensitive to the way the input data is prepared

Applicable data type: Nominal type data

Algorithm idea:

For example, we want to determine whether an email is spam, then we know the distribution of the word in this email, then we also need to know: the number of words in the spam message is how much, you can use the Bayes theorem.

One assumption in naive Bayes classifier is that each feature is equally important

Function
Loaddataset ()

Create a DataSet, where the dataset is a sentence consisting of a broken word that represents a user comment for a forum, and label 1 says it's a curse.

Createvocablist (DataSet)

Find out the total number of words in these sentences to determine the size of our word vectors

Setofwords2vec (Vocablist, Inputset)

To translate a sentence into a vector based on the words in it, here is the Bernoulli model, which only considers whether the word exists

BAGOFWORDS2VECMN (Vocablist, Inputset)

This is another model that converts a sentence into a vector, a polynomial model that takes into account the number of occurrences of a word.

TrainNB0 (Trainmatrix,traincatergory)

Calculates P (i) and P (w[i]| C[1]) and P (w[i]| C[0]), here are two tips, one is the beginning of the numerator denominator is not all initialized to 0 is to prevent one of the probability of 0 resulting in the whole of 0, and the other is a back multiply logarithm to prevent the result of the precision problem is 0

CLASSIFYNB (Vec2classify, P0vec, P1vec, PClass1)

According to the Bayesian formula, the probability of this vector belonging to two sets is high.

Copy Code code as follows:

#coding =utf-8
From numpy Import *
Def loaddataset ():
postinglist=[[' i ', ' dog ', ' has ', ' flea ', ' problems ', ' help ', ' please ',
[' Maybe ', ' not ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '],
[' My ', ' dalmation ', ' are ', ' so ', ' cute ', ' I ', ' love ', ' him '],
[' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '],
[' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' who ', ' to ', ' stop ', ' him '],
[' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']]
Classvec = [0,1,0,1,0,1] #1 is abusive, 0 not
Return Postinglist,classvec

#创建一个带有所有单词的列表
def createvocablist (dataSet):
    vocabset = set ([])
    For document in DataSet:
        vocabset = Vocabset | Set (document)
 & nbsp;  return list (vocabset)
   
def setofwords2vec (Vocablist, Inputset):
     retvocablist = [0] * len (vocablist)
    for Word in Inputset:
     ;    if Word in vocablist:
            Retvocablist[vocablist.index (word)] = 1
        else:
             print ' word ', word, ' not in Dict '
    return Retvocablist

#另一种模型
def bagofwords2vecmn (Vocablist, Inputset):
Returnvec = [0]*len (vocablist)
For word in Inputset:
If Word in vocablist:
Returnvec[vocablist.index (word)] + = 1
Return Returnvec

def trainNB0 (trainmatrix,traincatergory):
Numtraindoc = Len (Trainmatrix)
Numwords = Len (trainmatrix[0])
pabusive = SUM (traincatergory)/float (Numtraindoc)
#防止多个概率的成绩当中的一个为0
P0num = Ones (numwords)
P1num = Ones (numwords)
P0denom = 2.0
P1denom = 2.0
For I in Range (Numtraindoc):
If traincatergory[i] = = 1:
P1num +=trainmatrix[i]
P1denom + = SUM (Trainmatrix[i])
Else
P0num +=trainmatrix[i]
P0denom + = SUM (Trainmatrix[i])
P1vect = log (p1num/p1denom) #处于精度的考虑, otherwise most likely to zero
P0vect = log (p0num/p0denom)
Return p0vect,p1vect,pabusive

def classifynb (Vec2classify, P0vec, P1vec, PClass1):
P1 = SUM (vec2classify * P1vec) + log (pClass1) #element-wise mult
P0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1)
If p1 > P0:
Return 1
Else
return 0

Def TESTINGNB ():
listoposts,listclasses = Loaddataset ()
Myvocablist = Createvocablist (listoposts)
Trainmat=[]
For Postindoc in listoposts:
Trainmat.append (Setofwords2vec (Myvocablist, Postindoc))
P0v,p1v,pab = trainNB0 (Array (trainmat), Array (listclasses))
Testentry = [' Love ', ' my ', ' dalmation ']
Thisdoc = Array (Setofwords2vec (Myvocablist, Testentry))
Print Testentry, ' classified as: ', CLASSIFYNB (THISDOC,P0V,P1V,PAB)
Testentry = [' stupid ', ' garbage ']
Thisdoc = Array (Setofwords2vec (Myvocablist, Testentry))
Print Testentry, ' classified as: ', CLASSIFYNB (THISDOC,P0V,P1V,PAB)


def main ():
TESTINGNB ()

if __name__ = = ' __main__ ':
Main ()

I hope this article will help you with your Python programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.