The general process of naive Bayes

Source: Internet
Author: User
Tags prepare
The general process of naive Bayes

1, Collect data: can use any data. This article uses RSS feeds

2. Prepare data: Numeric or Boolean data required

3, the analysis of data, there are a large number of features, the drawing feature is not small, at this time using histogram effect better

4. Training algorithm: Calculate the conditional probabilities of different independent features

5. Test algorithm: Calculate error rate

6, the use of algorithms: a common naive Bayesian application is the document classification. Naive Bayesian classifier can be used in any categorical scenario, not necessarily text. 1. Prepare the data:

We will look at the text as a vector of words or terms, that is to say, convert a sentence to a vector. Consider appearing in all the text
All the words in the document, and then decide which words to put into the vocabulary or the vocabulary you want to set, then you have to convert each document to a vector on the glossary. And then we're officially starting. Open a text editor to create a new file called beye.py.
Then add the following list of programs to the file.

# Coding=utf-8 from
numpy import *
Def loaddataset (): Postinglist = [' My ', ' dog ', ' have ', ' flea ', ' problems ', ' help ', ' please '], [' May  Be ', ' no ', ' take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '], [' my ', ' dalmation ', ' are ', ' so ', ' cute ', ' I ', ' Love ', ' him '], [' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '], [' Mr ', ' lick S ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him '], [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' Stupid '] Classvec = [0, 1, 0, 1, 0, 1] # 1 is abusive (dirty language), 0 not return postinglist, Classvec # Create a list with all words de F createvocablist (dataSet): Vocabset = set ([]) # Creates an empty ©for document in Dataset:vocabset = Vocabset | Set (document) # Create a set of two collections return list (Vocabset) def setofwords2vec (Vocablist, inputset): Retvocablist = [0] * l En (vocablist) # Create a vector with elements that are 0 for word in Inputset:if word in Vocablist:retvocablist[vocabli
     St.index (word)] = 1   Else:print (' word ', word, ' not in Dict ') return retvocablist

 

The first function, Loaddataset (), creates some experimental samples. The first variable returned by the function is a collection of the documents that are segmented, and these documents come from the message boards of the Spotted dog lover. These message texts are cut into a series of entry sets,
Punctuation is removed from the text, and the details of the text processing are discussed later. The second variable returned by the Loaddataset () function is a collection of category labels. There are two categories, insulting and non-insulting. The categories of these texts are manually labeled, and the labels
Note information is used to train programs for automatic detection of insulting messages. The next function, Createvocablist (), creates a list of non-repeating words that appear in all documents, using Python's set data type. When the list of entries is lost to the set constructor, the set returns a non-repeating thesaurus.
First, create an empty collection ©, and then add the new Word collection returned by each document to the collection. Operator 丨 is used to request
Two sets of the set, which is also a bitwise OR operator to obtain a glossary, you can use the function Setofwords2vec (), the input parameters of the function is a glossary and a document, the output is a document vector, each element of the vector is 1 or 0, respectively, indicating whether the word in the vocabulary in the transmission of the humanities file appears. The function first creates a vector of the same length as the glossary and sets its elements to 0 ©. Next, iterate through all the words in the document,

If a word in the glossary appears, set the corresponding value in the output document vector to 1

Now look at the performance of these functions 2, analyze the data

Import bayes
listoposts,listclasses = Bayes.loaddataset () 
myvocablist = bayes.createvocablist (listoposts) 
Print (myvocablist)
[' Dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' stop ' , ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' stupid ', ' so ', ' take ', ' to ']

[' Dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' stop ' , ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' stupid ', ' so ', ' take ', ' to ']
Check the glossary above and you will find that there are no duplicate words. Currently the glossary is not sorted, and if needed, it can be sorted later

Here's a look at how the function Setofwords2vec () works:

Import Bayes 

Print (myvocablist)

print (Bayes.setofwords2vec (myvocablist, listoposts[0))

print ( Bayes.setofwords2vec (Myvocablist, listoposts[3]))
 [' dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' Stop ', ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' s Tupid ', ' so ', ' take ', ' to '] [1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0] [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] 

[' Dog ', ' buying ', ' garbage ', ' flea ', ' problems ', ' not ', ' is ', ' cute ', ' posting ', ' have ', ' quit ', ' how ', ' worthless ', ' stop ' , ' Mr ', ' dalmation ', ' maybe ', ' licks ', ' I ', ' ate ', ' Park ', ' my ', ' him ', ' help ', ' love ', ' food ', ' please ', ' steak ', ' stupid ', ' so ', ' take ', ' to ']
[1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
The function uses a glossary or all the words that you want to check as input, and then constructs a feature for each of these words. Once a document is given (a message on the Spotted Dog website), the document is converted to a word vector. Next, check the validity of the function. What word is the element with index 0 in Myvocablist. It should be the word dog. The word is in the first
appear in a document, now check to see if it appears in the fourth document. (not) 3, training algorithm: calculate probability from Word vector

Naïve Bayesian classifier training function

def trainNB0 (Trainmatrix, traincatergory):
    numtraindoc = Len (trainmatrix)
    numwords = Len (trainmatrix[0])
    pabusive = SUM (traincatergory)/float (numtraindoc)
    # One of the scores that prevents multiple probabilities is 0
    p0num = zeros (numwords) #1 molecule
    P1num = Zeros (numwords)
    p0denom = 0.0 # denominator 
    p1denom = 0.0 for
    i in range (Numtraindoc):
        if Traincatergory[i] = = 1:
            p1num + = Trainmatrix[i]
            p1denom + = SUM (Trainmatrix[i]) # 2
        Else:
            p0num + = Trainmatrix[i]
            p0denom + = SUM (Trainmatrix[i]) # 3
    p1vect = p1num/p1denom  # in the consideration of precision, otherwise it is possible to limit zero
    p0vect = p0num/p0de Nom
    return P0vect, P1vect, pabusive


Next Test.

import Bayes import Imp imp.reload (bayes) listoposts,listclasses = Bayes.loaddataset ()
Myvocablist = Bayes.createvocablist (listoposts) # So we've built a list myvocablist that contains all the words. Trainmat=[] for Postindoc in ListOPosts:trainMat.append (Bayes.setofwords2vec (Myvocablist, Postindoc)) # The For loop uses the word vector To populate the list.
The probability of an insulting document and a probability vector of two categories are given below. P0v,p1v,pab=bayes.trainnb0 (Trainmat, listclasses) print ("p0v:", p0v) print ("P1V:", p1v) print ("PAB:", PAB) 
 p0v: [0.04166667 0.
  0.0.04166667 0.04166667 0.          0.04166667 0.04166667 0.          0.04166667 0.          0.04166667 0.          0.04166667 0.04166667 0.04166667 0.          0.04166667 0.04166667 0.04166667 0.          0.125 0.08333333 0.04166667 0.04166667 0.          0.04166667 0.04166667 0.          0.04166667 0.          0.04166667] p1v: [0.10526316 0.05263158 0.05263158 0.          0.0.05263158 0.          0.0.05263158 0.
  0.05263158 0.          0.10526316 0.05263158 0.          0.0.05263158 0.          0.0.          0.05263158 0.          0.05263158 0.          0.0.05263158 0.          0.0.15789474 0. 0.05263158 0.05263158] pab:0.5 

p0v: [0.04166667 0. 0.0.04166667 0.04166667 0.
0.04166667 0.04166667 0. 0.04166667 0. 0.04166667
0.0.04166667 0.04166667 0.04166667 0. 0.04166667
0.04166667 0.04166667 0. 0.125 0.08333333 0.04166667
0.04166667 0. 0.04166667 0.04166667 0. 0.04166667
0.0.04166667]
P1V: [0.10526316 0.05263158 0.05263158 0. 0.0.05263158
0.0. 0.05263158 0. 0.05263158 0.
0.10526316 0.05263158 0. 0.0.05263158 0. 0.
0.0.05263158 0. 0.05263158 0. 0.
0.05263158 0. 0.0.15789474 0. 0.05263158
0.05263158]
pab:0.5 This is the probability that any document is an insulting document.

First, we find that the probability that the document belongs to the insulting class is 0.5, and the value is correct. Next, take a look at the probability of the occurrence of words in a glossary under a given document category, and see if they are correct. The second word in the glossary is buying, which appears 1 times in Category 1 and never appears in category 0. The corresponding conditional probabilities are 0.0 and 0.05263158, respectively. The calculation is correct. We look for the maximum value in all probabilities, and that value is now. (1) The 28th subscript position of the array, size 0.157 894 74. The word stupidstupid can be found in the 18th subscript position in Myvocablist. This means that stupid is the most able to characterize category 1 (insulting document class) words. Before using this function to classify, you also need to solve some defects in the function Naive Bayes classification function

def classifynb (Vec2classify, P0vec, P1vec, PClass1):
    p1 = SUM (vec2classify * P1vec) + log (PCLASS1)  # element-wise mult
    p0 = SUM (vec2classify * P0vec) + log (1.0-PCLASS1)
    if p1 > P0:
        return 1
    else:
        return 0


D EF TESTINGNB ():
    listoposts, listclasses = Loaddataset ()
    myvocablist = createvocablist (listoposts)
    Trainmat = [] for
    Postindoc in listoposts:
        trainmat.append (Setofwords2vec (Myvocablist, Postindoc))
    p0v, p1v, pAb = trainNB0 (Array (trainmat), Array (listclasses))
    testentry = [' Love ', ' my ', ' dalmation ']
    thisdoc = Array (Setofwords2vec (Myvocablist, testentry))
    print (Testentry, ' classified as: ', CLASSIFYNB (Thisdoc, p0v, P1V, PAB)
    testentry = [' stupid ', ' garbage ']
    thisdoc = Array (Setofwords2vec (Myvocablist, testentry))
    Print (Testentry, ' classified as: ', CLASSIFYNB (Thisdoc, p0v, P1V, pAb))

Next Test.

def main ():
    TESTINGNB ()


if __name__ = = ' __main__ ':
    Main ()
[' Love ', ' my ', ' dalmation '] classified as:  0
[' stupid ', ' garbage '] classified as:  1

Make some changes to the text to see what results the classifier will output. This example is very simple, but it shows how the naive Bayesian classifier works. Next, we'll make some changes to the code to make the classifier work better.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.