If I can't write my thesis, I'll run from it.
It's hard to start, take the first step
Come on.
========================================================================================
The principle of Bayesian not to repeat, the Internet still has a lot of information
Create a dataset that is in the case of a document classification
Def loaddataset ():
postinglist=[[' i ', ' dog ', ' has ', ' flea ', ' problem ', ' help ', ' please '],\
[' Maybe ', ' Don't ', ' Take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '],\
[' I ', ' dalmation ', ' are ', ' so ', ' cute ', ' I ', ' love ', ' him '],\
' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '],\
[' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him ' ],\
[' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']]
classvec=[0,1,0,1,0,1] return
postinglist, Classvec
The above function creates a small dataset that contains six documents, each of which has its own category (this example has only 0 and 12 categories)
def createvocablist (DataSet):
Vocabset=set ([]) for the
document in dataset:
Vocabset=vocabset|set (document
#循环对数据集内的每个文件提取word, set is used to
#求并集 return
list (vocabset)
This function converts the document set to a vocabulary (vocabulary) that contains all word in the document Set
Bayesian document classification is based on the vocabulary to convert documents to (eigenvector), the value of 0 and 1 means that there is or does not exist
def Setofwords2vec (vocablist,inputset):
returnvec=[0]*len (vocablist)
#创建一个所含元素都是0的向量 for
word in Inputset:
If Word in vocablist:
returnvec[vocablist.index (word)]=1
else:print ("The Word:%s. vocabulary! "%word" return
returnvec
#该函数首先创建一个与词汇表等长的向量
#输出表示判断文档中的单词在词汇表中是否出现
#从而将文档转换为词向量
Naive Bayesian classifier training function:
def trainNB0 (trainmatrix,traincategory):
numtraindocs=len (Trainmatrix)
#获取训练集的文档个数
Numwords=len ( Trainmatrix[0])
#由第一行的个数获得vocabulary的长度
pabusive=sum (traincategory)/float (numtraindocs)
#表示类别的概率, In this case, only the status of categories 0 and 1 is
P0num=zeros (numwords)
P1num=zeros (numwords)
#pXNum是一个与Vocabulary等长的向量, Used to count the number of times that word appears
p0denom=0.0
p1denom=0.0
#pXDenom表示第X类内单词的总数 for I in
Range (Numtraindocs):
if traincategory[i]==1:
p1num+=trainmatrix[i]
p1denom+=sum (trainmatrix[i)
else:
P0num+=trainmatrix[i]
p0denom+=sum (trainmatrix[i])
p1vec=p1num/p1denom
p0vec=p0num/p0denom
#vocabulary中的某个词在某类别里头出现的频率 return
p0vec,p1vec,pabusive
#首先搞清楚参数的意思
#结合前几个函数: Postinglist represents a collection of documents, each row represents a document, the number of lines is the number of documents
#classVec向量内值的个数与文档数相同, representing the classification of each document
#createVocabList函数把这些文档整合起来求得不含重复word的vocabulary
#setOfWords2Vec函数把一篇文档的word对应到vocabulary中, into a vector
#本函数的第一个参数表示每篇转化到vocabulary对应的向量, for N*m,n is the number of documents, M is the length of vocabulary
#trainCategory是一个向量, is the category for each document
Code for testing:
From numpy Import * The
Bayes
listpost,listclass=bayes.loaddataset ()
myvoc=bayes.createvocablist ( Listpost)
trainmat=[] for
postindoc in Listpost:
trainmat.append (Bayes.setofwords2vec )
p0v,p1v,pab=bayes.trainnb0 (trainmat,listclass) print (
myvoc) print (p0v) print (
p1v)
Print (PAB)
Here the naïve Bayes classifier trains the output of the function:
The probability (prior probability) of Word appearing in a category in vocabulary
Probability of occurrence of each category (prior probability)
In this case, the PAB result is 0.5, which means that the 0 and 12 classes are equal probabilities.
Modified according to the actual situation:
1. Initialization issues
When Bayesian documents are classified, multiple probabilities are required to obtain the probability that a document belongs to a category.
That is, to multiply the probability of each word in the document within each class to get the probability of the entire document to the category
But if a probability value is 0, the entire probability value is also 0. So the book will have all the words appear number initialized to 1, the denominator is initialized to 2
P0num=ones (numwords)
p1num=ones (numwords)
#pXNum的个数被初始化为1
p0denom=2.0
p1denom=2.0
2. Bottom overflow
Because there are many small numbers multiplied, it is easy to cause an overflow, the end will be rounded to 0.
The solution is: to take the logarithm of the product
ln (a*b) =ln (a) +ln (b)
The specific code is:
P1vec=log (p1num/p1denom)
P0vec=log (p0num/p0denom)
The last step is to integrate the above steps to classify
def CLASSIFYNB (vec2classify,p0vec,p1vec,pclass1):
p1=sum (Vec2classify*p1vec) +log (PCLASS1)
p0=sum ( Vec2classify*p0vec) +log (1.0-PCLASS1)
if p1>p0: return
1
else: return
0
For the detection vectors, the probability of each category is calculated, and the probability of the classification result is large.
Def TESTINGNB ():
listpost,listclass=loaddataset ()
myvoc=createvocablist (listpost)
trainmat=[]
For Postindoc in Listpost:
trainmat.append (Setofwords2vec (myvoc,postindoc))
p0v,p1v,pab=trainnb0 ( Trainmat,listclass)
testentry=[' love ', ' my ', ' dalmation ']
thisdoc=array (Setofwords2vec )
print (Testentry, ' classified as ', CLASSIFYNB (thisdoc,p0v,p1v,pab))
testentry=[' stupid ', ' garbage ']
Thisdoc=array (Setofwords2vec (myvoc,testentry))
print (Testentry, ' classified as ', CLASSIFYNB (Thisdoc, P0V,P1V,PAB))
Consolidate the above steps and use two test cases
Test results:
Use naive Bayes to filter spam
def textparse (bigstring):
import re
listoftokens=re.split (R ' \w* ', bigstring)
#使用中正则表达式提取 return
[ Token.lower () for token in Listoftokens if Len (token) >2]
</pre> here, I made a mistake, that is, the regular expression that piece, hit the lowercase w, so the result is wrong. How to make such a stupid mistake <p></p><p><span style= "Background-color:rgb (240,240,240)" ></span></p ><pre code_snippet_id= "1627130" snippet_file_name= "blog_20160329_12_57207" name= "code" class= "Python" >def Spamtest (): doclist=[];classlist=[];fulltext=[] for I in range (1,26): Wordlist=textparse (Open (' email\spam\
%d.txt '%i). Read ()) Doclist.append (wordlist) fulltext.append (wordlist) classlist.append (1) #正例 wordlist=textparse (Open (' Email\ham\%d.txt '%i). Read ()) Doclist.append (wordlist) Fulltext.appe nd (wordlist) classlist.append (0) #反例 vocabulary=createvocablist (doclist) trainingset=list (range (50 ) testset=[] for I in range: Randindex=int (Random.uniform (0,len (Trainingset))) #random模块用于生成随 The number of machines #random. Uniform (a,b) is used to generate random floating-point numbers within the scope of development testset.append (Trainingset[randindex) del trainingset[Randindex] #随机选择10个文档作为测试集, the rest as a training set trainmat=[];trainclasses=[] for Docindex in TRAININGSET:TRAINM At.append (Setofwords2vec (Vocabulary,doclist[docindex)) Trainclasses.append (Classlist[docindex]) #将选中的训练集逐 Integrated p0v,p1v,pspam=trainnb0 (trainmat,trainclasses) errorcount=0 for Docindex in testset:wordvector= Setofwords2vec (Vocabulary,doclist[docindex]) if (CLASSIFYNB (Array (wordvector), p0v,p1v,pspam)!=classlist[docindex
): Errorcount+=1 #如果分类结果与原类别不一致, error number plus 1 print (' The error rate is: ', float (errorcount)/len (Testset))
</pre> modified a place: according to the original words of a mistake is in Trainingset place <p></p><p><span style= "Background-color:rgb (240,240,240) ">del (Trainingset[randindex]) TypeError: ' Range ' object doesn ' t support item deletion</span> </p><p><span style= "Background-color:rgb (240,240,240)" > So, when I was initializing, I changed it to list </span> </p><p><span style= "Background-color:rgb (240,240,240)" ></span></p><p><span style=" Background-color:rgb ( 240,240,240) "></span></p><p><span style=" Background-color:rgb (240,240,240) "></"
Span><pre name= "code" class= "Python" >def Calcmostfreq (vocabulary,fulltext): Import operator freqdict={} For token in Vocabulary:freqdict[token]=fulltext.count (token) sortedfreq=sorted (freqdict.iteMS (), Key=operator.itemgetter (1), reverse=true) return sortedfreq[:30] #出现频率前30的词
def localwords (feed1,feed0): Import feedparser doclist=[];classlist=[];fulltext=[] Minlen=min (len (feed1[' Entri
Es ']), Len (feed0[' entries ')) for I in Range (Minlen): Wordlist=textparse (feed1[' entries ' "][i][')") Doclist.append (wordlist) fulltext.extend (wordlist) classlist.append (1) wordlist=textparse (feed0[
' Entries '][i][' summary ']) doclist.append (wordlist) fulltext.extend (wordlist) classlist.append (0) #两个RSS源作为正反例 vocabulary=createvocablist (doclist) #创建词汇库 top30words=calcmostfreq (vocabulary,fulltext) #获
The highest frequency 30 for PAIRW in Top30words:if pairw[0] in Vocabulary:vocabulary.remove (pairw[0)) is present #去除前30的单词 Trainingset=list (range (2*minlen)); testset=[] for I in Range (a): Randindex=int (Random.uniform (0,len T)) Testset.append (Trainingset[randindex]) del (Trainingset[randindex]) #随机选择训练和测试集; The test set is 20 TRAINM
At=[];trainclass=[]For Docindex in TrainingSet:trainMat.append (BAGOFWORDS2VECMN (vocabulary,doclist[docindex)) Trainclass.ap Pend (Classlist[docindex]) #将训练集内的文档转换成频数特征 p0v,p1v,pspam=trainnb0 (Array (trainmat), Array (trainclass)) Errorcoun T=0 for Docindex in Testset:wordvector=bagofwords2vecmn (Vocabulary,doclist[docindex]) if CLASSIFYNB (a Rray (Wordvector), P0v,p1v,pspam)!=classlist[docindex]: Errorcount+=1 print (' The error rate is: ', float (erro
Rcount)/len (testset)) return vocabulary,p0v,p1v
It's still modified.
Trainingset=list (range (2*minlen))
I do not know whether other students have encountered this problem, so the treatment is correct.
Code for testing:
Import Feedparser
ny=feedparser.parse (' Http://newyork.craigslist.org/stp/index.rss ')
sf= Feedparser.parse (' Http://sfbay.craigslist.org/stp/index.rss ')
vocabulary,psf,pny=bayes.localwords (NY,SF)
The result is that randomly sampled test sets and training sets will change.
The most descriptive words show:
def gettopword (NY,SF):
import operator
vocabulary,p0v,p1v=localwords (NY,SF)
topny=[];topsf=[]
For I in range (len (p0v)):
if P0v[i]>-6.0:topsf.append ((vocabulary[i],p0v[i))
if p1v[i]>-6.0: Topny.append ((Vocabulary[i],p1v[i]))
#按照排序选择
sortedsf=sorted (Topsf,key=lambda pair:pair[1],reverse=true
#pair:p Air[1] represents the sort of
print ("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF") for the second parameter of each
element Item in SORTEDSF:
print (item[0])
sortedny=sorted (topny,key=lambda pair:pair[1],reverse=true)
Print ("Ny**ny**ny**ny**ny**ny**ny**ny**ny**ny**ny**ny**ny")
For item in Sortedny:
print (item[0])
=========================================================================================
Download Installation Feedsparse
Download Address: Click to open the link
Installation method: First convert the path to the folder
then enter instruction python setup.py install