4 Classification method based on probability theory: Naive Bayes

Source: Internet
Author: User
Tags natural logarithm

4.5 using Python for text categorization

4.5.1 Preparing data: Building word vectors from text

#Coding:utf-8 fromNumPyImport*#Prepare data: Construct word vectors from textdefLoaddataset (): Postinglist= [['my','Dog',' has','Flea','problems',' Help',' please'],                   ['maybe',' not',' Take','him',' to','Dog','Park','Stupid'],                   ['my','dalmation',' is',' So','Cute','I',' Love','him'],                   ['Stop','Posting','Stupid','Worthless','Garbage'],                   ['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'],                   ['quit','Buying','Worthless','Dog',' Food','Stupid']]#a collection of documents after the entry is slicedClassvec = [0,1,0,1,0,1]#1 for insulting words, 0 for normal speech    returnpostinglist, Classvec#Create a vocabulary listdefcreatevocablist (dataSet): Vocabset=set ([]) forDocumentinchDataset:vocabset= Vocabset |set (document)returnlist (Vocabset)#Converts a group of words into a set of numbers, converting a glossary into a set of vectorsdefSetofwords2vec (Vocablist, Inputset):#Input: Glossary, a documentReturnvec = [0] *Len (vocablist) forWordinchInputset:ifWordinchVocablist:returnvec[vocablist.index (word)]= 1Else:Print "The Word:%s is isn't in my vocabulary!"%WordreturnReturnvec

4.5.2 Training algorithm: calculating probabilities from word vectors

#Training algorithm: Calculates the probability of each word appearing under each categorydefTrainNB0 (Trainmatrix, traincategory):#Input: Document matrix, vector of each document category labelNumtraindocs =Len (trainmatrix) numwords=Len (trainmatrix[0]) pabusive= SUM (traincategory)/float (Numtraindocs)#Prior probabilityP0num = zeros (numwords); P1num = Zeros (numwords)#molecule: ArrayP0denom = 0.0; P1denom = 0.0#denominator: Floating-point number     forIinchRange (Numtraindocs):ifTraincategory[i] = = 1:#category is 1P1num + = Trainmatrix[i]#moleculeP1denom + = SUM (Trainmatrix[i])#Denominator        Else: P0num+=Trainmatrix[i] P0denom+=sum (trainmatrix[i]) P1vect= P1num/p1denom#Conditional ProbabilitiesP0vect = P0num/p1denom#Conditional Probabilities    returnP0vect, P1vect, pabusive

4.5.3 Test algorithm: Modify the classifier according to the display situation

Laplace smoothing

Conditional probability P (w0|1) p (w1|1) p (w2|1), if one is 0, the last flight is also 0. To reduce this effect, all word occurrences can be initialized to 1, and the denominator is initialized to 2.

Open bayes.py, and modify lines 4th and 5th of TrainNB0 () to:

P0num = ones (numwords); P1num == 2.0; P2denom = 2.0

Another problem is that the next overflow is caused by too many decimal multiplies. One solution is to take a natural logarithm of the product, with no loss of natural logarithm processing.

Change the first two lines of code for TRAINNB0 () to:

P1vect = log (P1num/= log (p0num/p0denom)

Add the following code to the bayes.py:

#Test algorithm: Modify the classifier according to the real situation#naive Bayesian classification algorithmdefCLASSIFYNB (Vec2classify, P0vec, P1vec, PClass1):#Enter the first element: the vector to classifyP1 = SUM (vec2classify * P1vec) + log (PCLASS1)#Multiply elementsP0 = SUM (vec2classify * P0vec) + log (1.0-PClass1)ifP1 >P0:return1Else:        return0defTESTINGNB ():#Convenience functions convenience function: encapsulates all operationslistoposts, listclasses = Loaddataset ()#Tuning DataMyvocablist = Createvocablist (listoposts)#Build GlossaryTrainmat = []     forPostindocinchlistOPosts:trainMat.append (Setofwords2vec (Myvocablist, Postindoc)) p0v, p1v, PAb=trainNB0 (Array (trainmat), Array (listclasses)) Testentry= [' Love','my','dalmation'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry))PrintTestentry,'classified as:', CLASSIFYNB (Thisdoc, p0v, p1v, pAb) testentry= ['Stupid','Garbage'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry))PrintTestentry,'classified as:', Classifynb (Thisdoc, p0v, P1V, PAb)

4.5.4 Preparing data: Modifying the classifier according to the display situation

So far, we have used each word's appearance as a feature, which is described as a word set model (Set-of-words models).

If each word can appear multiple times as a feature, this is described as a word bag model (Bag-of-words models).

In order to adapt to the word bag model, it is necessary to modify the Setofwords2vec () slightly, the only difference is that when each word is encountered, it increases the corresponding value in the word vector instead of just setting the corresponding number to 1.

# Converts a group of words into a set of numbers, converting a glossary into a set of vectors: A word set model def Bagofwords2vec (Vocablist, Inputset):# Input: Glossary, a document    Returnvec = [0] * Len ( vocablist)    for in  inputset:        if in vocablist:             + = 1    return Returnvec

Now that the classifier has been built, the classifier will be used to filter the junk e-mail.

4 Classification method based on probability theory: Naive Bayes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.