4.5 using Python for text categorization
4.5.1 Preparing data: Building word vectors from text
#Coding:utf-8 fromNumPyImport*#Prepare data: Construct word vectors from textdefLoaddataset (): Postinglist= [['my','Dog',' has','Flea','problems',' Help',' please'], ['maybe',' not',' Take','him',' to','Dog','Park','Stupid'], ['my','dalmation',' is',' So','Cute','I',' Love','him'], ['Stop','Posting','Stupid','Worthless','Garbage'], ['Mr','Licks','ate','my','Steak',' How',' to','Stop','him'], ['quit','Buying','Worthless','Dog',' Food','Stupid']]#a collection of documents after the entry is slicedClassvec = [0,1,0,1,0,1]#1 for insulting words, 0 for normal speech returnpostinglist, Classvec#Create a vocabulary listdefcreatevocablist (dataSet): Vocabset=set ([]) forDocumentinchDataset:vocabset= Vocabset |set (document)returnlist (Vocabset)#Converts a group of words into a set of numbers, converting a glossary into a set of vectorsdefSetofwords2vec (Vocablist, Inputset):#Input: Glossary, a documentReturnvec = [0] *Len (vocablist) forWordinchInputset:ifWordinchVocablist:returnvec[vocablist.index (word)]= 1Else:Print "The Word:%s is isn't in my vocabulary!"%WordreturnReturnvec
4.5.2 Training algorithm: calculating probabilities from word vectors
#Training algorithm: Calculates the probability of each word appearing under each categorydefTrainNB0 (Trainmatrix, traincategory):#Input: Document matrix, vector of each document category labelNumtraindocs =Len (trainmatrix) numwords=Len (trainmatrix[0]) pabusive= SUM (traincategory)/float (Numtraindocs)#Prior probabilityP0num = zeros (numwords); P1num = Zeros (numwords)#molecule: ArrayP0denom = 0.0; P1denom = 0.0#denominator: Floating-point number forIinchRange (Numtraindocs):ifTraincategory[i] = = 1:#category is 1P1num + = Trainmatrix[i]#moleculeP1denom + = SUM (Trainmatrix[i])#Denominator Else: P0num+=Trainmatrix[i] P0denom+=sum (trainmatrix[i]) P1vect= P1num/p1denom#Conditional ProbabilitiesP0vect = P0num/p1denom#Conditional Probabilities returnP0vect, P1vect, pabusive
4.5.3 Test algorithm: Modify the classifier according to the display situation
Laplace smoothing
Conditional probability P (w0|1) p (w1|1) p (w2|1), if one is 0, the last flight is also 0. To reduce this effect, all word occurrences can be initialized to 1, and the denominator is initialized to 2.
Open bayes.py, and modify lines 4th and 5th of TrainNB0 () to:
P0num = ones (numwords); P1num == 2.0; P2denom = 2.0
Another problem is that the next overflow is caused by too many decimal multiplies. One solution is to take a natural logarithm of the product, with no loss of natural logarithm processing.
Change the first two lines of code for TRAINNB0 () to:
P1vect = log (P1num/= log (p0num/p0denom)
Add the following code to the bayes.py:
#Test algorithm: Modify the classifier according to the real situation#naive Bayesian classification algorithmdefCLASSIFYNB (Vec2classify, P0vec, P1vec, PClass1):#Enter the first element: the vector to classifyP1 = SUM (vec2classify * P1vec) + log (PCLASS1)#Multiply elementsP0 = SUM (vec2classify * P0vec) + log (1.0-PClass1)ifP1 >P0:return1Else: return0defTESTINGNB ():#Convenience functions convenience function: encapsulates all operationslistoposts, listclasses = Loaddataset ()#Tuning DataMyvocablist = Createvocablist (listoposts)#Build GlossaryTrainmat = [] forPostindocinchlistOPosts:trainMat.append (Setofwords2vec (Myvocablist, Postindoc)) p0v, p1v, PAb=trainNB0 (Array (trainmat), Array (listclasses)) Testentry= [' Love','my','dalmation'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry))PrintTestentry,'classified as:', CLASSIFYNB (Thisdoc, p0v, p1v, pAb) testentry= ['Stupid','Garbage'] Thisdoc=Array (Setofwords2vec (Myvocablist, testentry))PrintTestentry,'classified as:', Classifynb (Thisdoc, p0v, P1V, PAb)
4.5.4 Preparing data: Modifying the classifier according to the display situation
So far, we have used each word's appearance as a feature, which is described as a word set model (Set-of-words models).
If each word can appear multiple times as a feature, this is described as a word bag model (Bag-of-words models).
In order to adapt to the word bag model, it is necessary to modify the Setofwords2vec () slightly, the only difference is that when each word is encountered, it increases the corresponding value in the word vector instead of just setting the corresponding number to 1.
# Converts a group of words into a set of numbers, converting a glossary into a set of vectors: A word set model def Bagofwords2vec (Vocablist, Inputset):# Input: Glossary, a document Returnvec = [0] * Len ( vocablist) for in inputset: if in vocablist: + = 1 return Returnvec
Now that the classifier has been built, the classifier will be used to filter the junk e-mail.
4 Classification method based on probability theory: Naive Bayes