The general process of naive Bayes
1, Collect data: can use any data. This article uses RSS feeds
2. Prepare data: Numeric or Boolean data required
3, the analysis of data, there are a large number of features, the drawing feature is not small, at this time using histogram effect better
4. Training algorithm: Calculate the conditional probabilities of different independent features
5. Test algorithm: Calculate error rate
6, the use of algorithms: a common naive Bayesian application is the document classification. Naive Bayesian classifier can be used in any categorical scenario, not necessarily text. 1. Prepare the data:
We will look at the text as a vector of words or terms, that is to say, convert a sentence to a vector. Consider appearing in all the text
All the words in the document, and then decide which words to put into the vocabulary or the vocabulary you want to set, then you have to convert each document to a vector on the glossary. And then we're officially starting. Open a text editor to create a new file called beye.py.
Then add the following list of programs to the file.
# Coding=utf-8 from
numpy import *
Def loaddataset (): Postinglist = [' My ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' please '], [' May Be ', ' no ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' my ', ' dalmation ', ' are ', ' so ', ' cute ', ' I ', ' Love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '], [' Mr ', ' lick S ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' Stupid '] Classvec = [0, 1, 0, 1, 0, 1] # 1 is abusive (dirty language), 0 not return postinglist, Classvec # Create a list with all words de F createvocablist (dataSet): Vocabset = set ([]) # Creates an empty ©for document in Dataset:vocabset = Vocabset | Set (document) # Create a set of two collections return list (Vocabset) def setofwords2vec (Vocablist, inputset): Retvocablist = [0] * l En (vocablist) # Create a vector with elements that are 0 for word in Inputset:if word in Vocablist:retvocablist[vocabli
St.index (word)] = 1 Else:print (' word ', word, ' not in Dict ') return retvocablist
The first function, Loaddataset (), creates some experimental samples. The first variable returned by the function is a collection of the documents that are segmented, and these documents come from the message boards of the Spotted dog lover. These message texts are cut into a series of entry sets,
Punctuation is removed from the text, and the details of the text processing are discussed later. The second variable returned by the Loaddataset () function is a collection of category labels. There are two categories, insulting and non-insulting. The categories of these texts are manually labeled, and the labels
Note information is used to train programs for automatic detection of insulting messages. The next function, Createvocablist (), creates a list of non-repeating words that appear in all documents, using Python's set data type. When the list of entries is lost to the set constructor, the set returns a non-repeating thesaurus.
First, create an empty collection ©, and then add the new Word collection returned by each document to the collection. Operator 丨 is used to request
Two sets of the set, which is also a bitwise OR operator to obtain a glossary, you can use the function Setofwords2vec (), the input parameters of the function is a glossary and a document, the output is a document vector, each element of the vector is 1 or 0, respectively, indicating whether the word in the vocabulary in the transmission of the humanities file appears. The function first creates a vector of the same length as the glossary and sets its elements to 0 ©. Next, iterate through all the words in the document,
If a word in the glossary appears, set the corresponding value in the output document vector to 1
Now look at the performance of these functions 2, analyze the data
Import bayes
listoposts,listclasses = Bayes.loaddataset ()
myvocablist = bayes.createvocablist (listoposts)
Print (myvocablist)
[' Dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' stop ' , ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' stupid ', ' so ', ' take ', ' to ']
[' Dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' stop ' , ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' stupid ', ' so ', ' take ', ' to ']
Check the glossary above and you will find that there are no duplicate words. Currently the glossary is not sorted, and if needed, it can be sorted later
Here's a look at how the function Setofwords2vec () works:
Import Bayes
Print (myvocablist)
print (Bayes.setofwords2vec (myvocablist, listoposts[0))
print ( Bayes.setofwords2vec (Myvocablist, listoposts[3]))
[' dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' Stop ', ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' s Tupid ', ' so ', ' take ', ' to '] [1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0] [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
[' Dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' stop ' , ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' stupid ', ' so ', ' take ', ' to ']
[1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
The function uses a glossary or all the words that you want to check as input, and then constructs a feature for each of these words. Once a document is given (a message on the Spotted Dog website), the document is converted to a word vector. Next, check the validity of the function. What word is the element with index 0 in Myvocablist. It should be the word dog. The word is in the first
appear in a document, now check to see if it appears in the fourth document. (not) 3, training algorithm: calculate probability from Word vector
Naïve Bayesian classifier training function
def trainNB0 (Trainmatrix, traincatergory):
numtraindoc = Len (trainmatrix)
numwords = Len (trainmatrix[0])
pabusive = SUM (traincatergory)/float (numtraindoc)
# One of the scores that prevents multiple probabilities is 0
p0num = zeros (numwords) #1 molecule
P1num = Zeros (numwords)
p0denom = 0.0 # denominator
p1denom = 0.0 for
i in range (Numtraindoc):
if Traincatergory[i] = = 1:
p1num + = Trainmatrix[i]
p1denom + = SUM (Trainmatrix[i]) # 2
Else:
p0num + = Trainmatrix[i]
p0denom + = SUM (Trainmatrix[i]) # 3
p1vect = p1num/p1denom # in the consideration of precision, otherwise it is possible to limit zero
p0vect = p0num/p0de Nom
return P0vect, P1vect, pabusive
Next Test.
import Bayes import Imp imp.reload (bayes) listoposts,listclasses = Bayes.loaddataset ()
Myvocablist = Bayes.createvocablist (listoposts) # So we've built a list myvocablist that contains all the words. Trainmat=[] for Postindoc in ListOPosts:trainMat.append (Bayes.setofwords2vec (Myvocablist, Postindoc)) # The For loop uses the word vector To populate the list.
The probability of an insulting document and a probability vector of two categories are given below. P0v,p1v,pab=bayes.trainnb0 (Trainmat, listclasses) print ("p0v:", p0v) print ("P1V:", p1v) print ("PAB:", PAB)
p0v: [0.04166667 0.
0.0.04166667 0.04166667 0. 0.04166667 0.04166667 0. 0.04166667 0. 0.04166667 0. 0.04166667 0.04166667 0.04166667 0. 0.04166667 0.04166667 0.04166667 0. 0.125 0.08333333 0.04166667 0.04166667 0. 0.04166667 0.04166667 0. 0.04166667 0. 0.04166667] p1v: [0.10526316 0.05263158 0.05263158 0. 0.0.05263158 0. 0.0.05263158 0.
0.05263158 0. 0.10526316 0.05263158 0. 0.0.05263158 0. 0.0. 0.05263158 0. 0.05263158 0. 0.0.05263158 0. 0.0.15789474 0. 0.05263158 0.05263158] pab:0.5
p0v: [0.04166667 0. 0.0.04166667 0.04166667 0.
0.04166667 0.04166667 0. 0.04166667 0. 0.04166667
0.0.04166667 0.04166667 0.04166667 0. 0.04166667
0.04166667 0.04166667 0. 0.125 0.08333333 0.04166667
0.04166667 0. 0.04166667 0.04166667 0. 0.04166667
0.0.04166667]
P1V: [0.10526316 0.05263158 0.05263158 0. 0.0.05263158
0.0. 0.05263158 0. 0.05263158 0.
0.10526316 0.05263158 0. 0.0.05263158 0. 0.
0.0.05263158 0. 0.05263158 0. 0.
0.05263158 0. 0.0.15789474 0. 0.05263158
0.05263158]
pab:0.5 This is the probability that any document is an insulting document.
First, we find that the probability that the document belongs to the insulting class is 0.5, and the value is correct. Next, take a look at the probability of the occurrence of words in a glossary under a given document category, and see if they are correct. The second word in the glossary is buying, which appears 1 times in Category 1 and never appears in category 0. The corresponding conditional probabilities are 0.0 and 0.05263158, respectively. The calculation is correct. We look for the maximum value in all probabilities, and that value is now. (1) The 28th subscript position of the array, size 0.157 894 74. The word stupidstupid can be found in the 18th subscript position in Myvocablist. This means that stupid is the most able to characterize category 1 (insulting document class) words. Before using this function to classify, you also need to solve some defects in the function Naive Bayes classification function
def classifynb (Vec2classify, P0vec, P1vec, PClass1):
p1 = SUM (vec2classify * P1vec) + log (PCLASS1) # element-wise mult
p0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1)
if p1 > P0:
return 1
else:
return 0
D EF TESTINGNB ():
listoposts, listclasses = Loaddataset ()
myvocablist = createvocablist (listoposts)
Trainmat = [] for
Postindoc in listoposts:
trainmat.append (Setofwords2vec (Myvocablist, Postindoc))
p0v, p1v, pAb = trainNB0 (Array (trainmat), Array (listclasses))
testentry = [' Love ', ' my ', ' dalmation ']
thisdoc = Array (Setofwords2vec (Myvocablist, testentry))
print (Testentry, ' classified as: ', CLASSIFYNB (Thisdoc, p0v, P1V, PAB)
testentry = [' stupid ', ' garbage ']
thisdoc = Array (Setofwords2vec (Myvocablist, testentry))
Print (Testentry, ' classified as: ', CLASSIFYNB (Thisdoc, p0v, P1V, pAb))
Next Test.
def main ():
TESTINGNB ()
if __name__ = = ' __main__ ':
Main ()
[' Love ', ' my ', ' dalmation '] classified as: 0
[' stupid ', ' garbage '] classified as: 1
Make some changes to the text to see what results the classifier will output. This example is very simple, but it shows how the naive Bayesian classifier works. Next, we'll make some changes to the code to make the classifier work better.