4 Classification method based on probability theory: Naive Bayes

Last Update:2016-01-19 Source: Internet

Author: User

Tags natural logarithm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

4.5 using Python for text categorization

4.5.1 Preparing data: Building word vectors from text

#Coding:utf-8 fromNumPyImport*#Prepare data: Construct word vectors from textdefLoaddataset (): Postinglist= [['my','Dog',' has','Flea','problems',' Help',' please'],                   ['maybe',' not',' Take','him',' to','Dog','Park','Stupid'],                   ['my','dalmation',' is',' So','Cute','I',' Love','him'],                   ['Stop','Posting','Stupid','Worthless','Garbage'],                   ['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'],                   ['quit','Buying','Worthless','Dog',' Food','Stupid']]#a collection of documents after the entry is slicedClassvec = [0,1,0,1,0,1]#1 for insulting words, 0 for normal speech    returnpostinglist, Classvec#Create a vocabulary listdefcreatevocablist (dataSet): Vocabset=set ([]) forDocumentinchDataset:vocabset= Vocabset |set (document)returnlist (Vocabset)#Converts a group of words into a set of numbers, converting a glossary into a set of vectorsdefSetofwords2vec (Vocablist, Inputset):#Input: Glossary, a documentReturnvec = [0] *Len (vocablist) forWordinchInputset:ifWordinchVocablist:returnvec[vocablist.index (word)]= 1Else:Print "The Word:%s is isn't in my vocabulary!"%WordreturnReturnvec

4.5.2 Training algorithm: calculating probabilities from word vectors

#Training algorithm: Calculates the probability of each word appearing under each categorydefTrainNB0 (Trainmatrix, traincategory):#Input: Document matrix, vector of each document category labelNumtraindocs =Len (trainmatrix) numwords=Len (trainmatrix[0]) pabusive= SUM (traincategory)/float (Numtraindocs)#Prior probabilityP0num = zeros (numwords); P1num = Zeros (numwords)#molecule: ArrayP0denom = 0.0; P1denom = 0.0#denominator: Floating-point number     forIinchRange (Numtraindocs):ifTraincategory[i] = = 1:#category is 1P1num + = Trainmatrix[i]#moleculeP1denom + = SUM (Trainmatrix[i])#Denominator        Else: P0num+=Trainmatrix[i] P0denom+=sum (trainmatrix[i]) P1vect= P1num/p1denom#Conditional ProbabilitiesP0vect = P0num/p1denom#Conditional Probabilities    returnP0vect, P1vect, pabusive

4.5.3 Test algorithm: Modify the classifier according to the display situation

Laplace smoothing

Conditional probability P (w0|1) p (w1|1) p (w2|1), if one is 0, the last flight is also 0. To reduce this effect, all word occurrences can be initialized to 1, and the denominator is initialized to 2.

Open bayes.py, and modify lines 4th and 5th of TrainNB0 () to:

P0num = ones (numwords); P1num == 2.0; P2denom = 2.0

Another problem is that the next overflow is caused by too many decimal multiplies. One solution is to take a natural logarithm of the product, with no loss of natural logarithm processing.

Change the first two lines of code for TRAINNB0 () to:

P1vect = log (P1num/= log (p0num/p0denom)

Add the following code to the bayes.py:

#Test algorithm: Modify the classifier according to the real situation#naive Bayesian classification algorithmdefCLASSIFYNB (Vec2classify, P0vec, P1vec, PClass1):#Enter the first element: the vector to classifyP1 = SUM (vec2classify * P1vec) + log (PCLASS1)#Multiply elementsP0 = SUM (vec2classify * P0vec) + log (1.0-PClass1)ifP1 >P0:return1Else:        return0defTESTINGNB ():#Convenience functions convenience function: encapsulates all operationslistoposts, listclasses = Loaddataset ()#Tuning DataMyvocablist = Createvocablist (listoposts)#Build GlossaryTrainmat = []     forPostindocinchlistOPosts:trainMat.append (Setofwords2vec (Myvocablist, Postindoc)) p0v, p1v, PAb=trainNB0 (Array (trainmat), Array (listclasses)) Testentry= [' Love','my','dalmation'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry))PrintTestentry,'classified as:', CLASSIFYNB (Thisdoc, p0v, p1v, pAb) testentry= ['Stupid','Garbage'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry))PrintTestentry,'classified as:', Classifynb (Thisdoc, p0v, P1V, PAb)

4.5.4 Preparing data: Modifying the classifier according to the display situation

So far, we have used each word's appearance as a feature, which is described as a word set model (Set-of-words models).

If each word can appear multiple times as a feature, this is described as a word bag model (Bag-of-words models).

In order to adapt to the word bag model, it is necessary to modify the Setofwords2vec () slightly, the only difference is that when each word is encountered, it increases the corresponding value in the word vector instead of just setting the corresponding number to 1.

# Converts a group of words into a set of numbers, converting a glossary into a set of vectors: A word set model def Bagofwords2vec (Vocablist, Inputset):# Input: Glossary, a document    Returnvec = [0] * Len ( vocablist)    for in  inputset:        if in vocablist:             + = 1    return Returnvec

Now that the classifier has been built, the classifier will be used to filter the junk e-mail.

4 Classification method based on probability theory: Naive Bayes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More