Python Learning-Machine learning combat-ch04 Bayes__python

Source: Internet
Author: User
Tags extend

If I can't write my thesis, I'll run from it.

It's hard to start, take the first step

Come on.

========================================================================================

The principle of Bayesian not to repeat, the Internet still has a lot of information


Create a dataset that is in the case of a document classification

Def loaddataset ():
    postinglist=[[' i ', ' dog ', ' has ', ' flea ', ' problem ', ' help ', ' please '],\
                 [' Maybe ', ' Don't ', ' Take ', ' him ', ' to ', ' dog ', ' Park ', ' stupid '],\
                 [' I ', ' dalmation ', ' are ', ' so ', ' cute ', ' I ', ' love ', ' him '],\
                 ' Stop ', ' posting ', ' stupid ', ' worthless ', ' garbage '],\
                 [' Mr ', ' licks ', ' ate ', ' my ', ' steak ', ' How ', ' to ', ' stop ', ' him ' ],\
                 [' Quit ', ' buying ', ' worthless ', ' dog ', ' food ', ' stupid ']]
    classvec=[0,1,0,1,0,1] return
    postinglist, Classvec

The above function creates a small dataset that contains six documents, each of which has its own category (this example has only 0 and 12 categories)

def createvocablist (DataSet):
    Vocabset=set ([]) for the
    document in dataset:
        Vocabset=vocabset|set (document
    #循环对数据集内的每个文件提取word, set is used to
    #求并集 return
    list (vocabset)

This function converts the document set to a vocabulary (vocabulary) that contains all word in the document Set

Bayesian document classification is based on the vocabulary to convert documents to (eigenvector), the value of 0 and 1 means that there is or does not exist

def Setofwords2vec (vocablist,inputset):
    returnvec=[0]*len (vocablist)
    #创建一个所含元素都是0的向量 for
    word in Inputset:
        If Word in vocablist:
            returnvec[vocablist.index (word)]=1
        else:print ("The Word:%s. vocabulary! "%word" return
    returnvec
#该函数首先创建一个与词汇表等长的向量
#输出表示判断文档中的单词在词汇表中是否出现
#从而将文档转换为词向量

Naive Bayesian classifier training function:

def trainNB0 (trainmatrix,traincategory):
    numtraindocs=len (Trainmatrix)
    #获取训练集的文档个数
    Numwords=len ( Trainmatrix[0])
    #由第一行的个数获得vocabulary的长度
    pabusive=sum (traincategory)/float (numtraindocs)
    #表示类别的概率, In this case, only the status of categories 0 and 1 is
    P0num=zeros (numwords)
    P1num=zeros (numwords)
    #pXNum是一个与Vocabulary等长的向量, Used to count the number of times that word appears
    p0denom=0.0
    p1denom=0.0
    #pXDenom表示第X类内单词的总数 for I in
    Range (Numtraindocs):
        if traincategory[i]==1:
            p1num+=trainmatrix[i]
            p1denom+=sum (trainmatrix[i)
        else:
            P0num+=trainmatrix[i]
            p0denom+=sum (trainmatrix[i])

    p1vec=p1num/p1denom
    p0vec=p0num/p0denom
    #vocabulary中的某个词在某类别里头出现的频率 return
    p0vec,p1vec,pabusive

#首先搞清楚参数的意思
#结合前几个函数: Postinglist represents a collection of documents, each row represents a document, the number of lines is the number of documents
#classVec向量内值的个数与文档数相同, representing the classification of each document
#createVocabList函数把这些文档整合起来求得不含重复word的vocabulary
#setOfWords2Vec函数把一篇文档的word对应到vocabulary中, into a vector
#本函数的第一个参数表示每篇转化到vocabulary对应的向量, for N*m,n is the number of documents, M is the length of vocabulary
#trainCategory是一个向量, is the category for each document


Code for testing:

From numpy Import * The
Bayes

listpost,listclass=bayes.loaddataset ()
myvoc=bayes.createvocablist ( Listpost)

trainmat=[] for
postindoc in Listpost:
    trainmat.append (Bayes.setofwords2vec )

p0v,p1v,pab=bayes.trainnb0 (trainmat,listclass) print (
myvoc) print (p0v) print (
p1v)
Print (PAB)

Here the naïve Bayes classifier trains the output of the function:

The probability (prior probability) of Word appearing in a category in vocabulary

Probability of occurrence of each category (prior probability)

In this case, the PAB result is 0.5, which means that the 0 and 12 classes are equal probabilities.


Modified according to the actual situation:

1. Initialization issues

When Bayesian documents are classified, multiple probabilities are required to obtain the probability that a document belongs to a category.

That is, to multiply the probability of each word in the document within each class to get the probability of the entire document to the category

But if a probability value is 0, the entire probability value is also 0. So the book will have all the words appear number initialized to 1, the denominator is initialized to 2

    P0num=ones (numwords)
    p1num=ones (numwords)
    #pXNum的个数被初始化为1
    p0denom=2.0
    p1denom=2.0

2. Bottom overflow

Because there are many small numbers multiplied, it is easy to cause an overflow, the end will be rounded to 0.

The solution is: to take the logarithm of the product

ln (a*b) =ln (a) +ln (b)

The specific code is:

    P1vec=log (p1num/p1denom)
    P0vec=log (p0num/p0denom)


The last step is to integrate the above steps to classify

def CLASSIFYNB (vec2classify,p0vec,p1vec,pclass1):
    p1=sum (Vec2classify*p1vec) +log (PCLASS1)
    p0=sum ( Vec2classify*p0vec) +log (1.0-PCLASS1)
    if p1>p0: return
        1
    else: return
        0

For the detection vectors, the probability of each category is calculated, and the probability of the classification result is large.

Def TESTINGNB ():
    listpost,listclass=loaddataset ()
    myvoc=createvocablist (listpost)
    trainmat=[]
    For Postindoc in Listpost:
        trainmat.append (Setofwords2vec (myvoc,postindoc))
    p0v,p1v,pab=trainnb0 ( Trainmat,listclass)

    testentry=[' love ', ' my ', ' dalmation ']
    thisdoc=array (Setofwords2vec )
    print (Testentry, ' classified as ', CLASSIFYNB (thisdoc,p0v,p1v,pab))

    testentry=[' stupid ', ' garbage ']
    Thisdoc=array (Setofwords2vec (myvoc,testentry))
    print (Testentry, ' classified as ', CLASSIFYNB (Thisdoc, P0V,P1V,PAB))

Consolidate the above steps and use two test cases

Test results:



Use naive Bayes to filter spam

def textparse (bigstring):
    import re
    listoftokens=re.split (R ' \w* ', bigstring)
    #使用中正则表达式提取 return
    [ Token.lower () for token in Listoftokens if Len (token) >2]
</pre> here, I made a mistake, that is, the regular expression that piece, hit the lowercase w, so the result is wrong. How to make such a stupid mistake <p></p><p><span style= "Background-color:rgb (240,240,240)" ></span></p ><pre code_snippet_id= "1627130" snippet_file_name= "blog_20160329_12_57207" name= "code" class= "Python" >def Spamtest (): doclist=[];classlist=[];fulltext=[] for I in range (1,26): Wordlist=textparse (Open (' email\spam\
        %d.txt '%i). Read ()) Doclist.append (wordlist) fulltext.append (wordlist) classlist.append (1) #正例 wordlist=textparse (Open (' Email\ham\%d.txt '%i). Read ()) Doclist.append (wordlist) Fulltext.appe nd (wordlist) classlist.append (0) #反例 vocabulary=createvocablist (doclist) trainingset=list (range (50 ) testset=[] for I in range: Randindex=int (Random.uniform (0,len (Trainingset))) #random模块用于生成随 The number of machines #random. Uniform (a,b) is used to generate random floating-point numbers within the scope of development testset.append (Trainingset[randindex) del trainingset[Randindex] #随机选择10个文档作为测试集, the rest as a training set trainmat=[];trainclasses=[] for Docindex in TRAININGSET:TRAINM At.append (Setofwords2vec (Vocabulary,doclist[docindex)) Trainclasses.append (Classlist[docindex]) #将选中的训练集逐 Integrated p0v,p1v,pspam=trainnb0 (trainmat,trainclasses) errorcount=0 for Docindex in testset:wordvector= Setofwords2vec (Vocabulary,doclist[docindex]) if (CLASSIFYNB (Array (wordvector), p0v,p1v,pspam)!=classlist[docindex
 ): Errorcount+=1 #如果分类结果与原类别不一致, error number plus 1 print (' The error rate is: ', float (errorcount)/len (Testset))
</pre> modified a place: according to the original words of a mistake is in Trainingset place <p></p><p><span style= "Background-color:rgb (240,240,240) ">del (Trainingset[randindex]) TypeError: ' Range ' object doesn ' t support item deletion</span> </p><p><span style= "Background-color:rgb (240,240,240)" > So, when I was initializing, I changed it to list </span> </p><p><span style= "Background-color:rgb (240,240,240)" ></span></p><p><span style=" Background-color:rgb ( 240,240,240) "></span></p><p><span style=" Background-color:rgb (240,240,240) "></"
    Span><pre name= "code" class= "Python" >def Calcmostfreq (vocabulary,fulltext): Import operator freqdict={} For token in Vocabulary:freqdict[token]=fulltext.count (token) sortedfreq=sorted (freqdict.iteMS (), Key=operator.itemgetter (1), reverse=true) return sortedfreq[:30] #出现频率前30的词 

def localwords (feed1,feed0): Import feedparser doclist=[];classlist=[];fulltext=[] Minlen=min (len (feed1[' Entri
        Es ']), Len (feed0[' entries ')) for I in Range (Minlen): Wordlist=textparse (feed1[' entries ' "][i][')") Doclist.append (wordlist) fulltext.extend (wordlist) classlist.append (1) wordlist=textparse (feed0[
    ' Entries '][i][' summary ']) doclist.append (wordlist) fulltext.extend (wordlist) classlist.append (0) #两个RSS源作为正反例 vocabulary=createvocablist (doclist) #创建词汇库 top30words=calcmostfreq (vocabulary,fulltext) #获

    The highest frequency 30 for PAIRW in Top30words:if pairw[0] in Vocabulary:vocabulary.remove (pairw[0)) is present #去除前30的单词 Trainingset=list (range (2*minlen)); testset=[] for I in Range (a): Randindex=int (Random.uniform (0,len T)) Testset.append (Trainingset[randindex]) del (Trainingset[randindex]) #随机选择训练和测试集; The test set is 20 TRAINM
    At=[];trainclass=[]For Docindex in TrainingSet:trainMat.append (BAGOFWORDS2VECMN (vocabulary,doclist[docindex)) Trainclass.ap Pend (Classlist[docindex]) #将训练集内的文档转换成频数特征 p0v,p1v,pspam=trainnb0 (Array (trainmat), Array (trainclass)) Errorcoun T=0 for Docindex in Testset:wordvector=bagofwords2vecmn (Vocabulary,doclist[docindex]) if CLASSIFYNB (a Rray (Wordvector), P0v,p1v,pspam)!=classlist[docindex]: Errorcount+=1 print (' The error rate is: ', float (erro
 Rcount)/len (testset)) return vocabulary,p0v,p1v


It's still modified.

Trainingset=list (range (2*minlen))

I do not know whether other students have encountered this problem, so the treatment is correct.

Code for testing:

Import Feedparser
ny=feedparser.parse (' Http://newyork.craigslist.org/stp/index.rss ')
sf= Feedparser.parse (' Http://sfbay.craigslist.org/stp/index.rss ')
vocabulary,psf,pny=bayes.localwords (NY,SF)

The result is that randomly sampled test sets and training sets will change.

The most descriptive words show:

def gettopword (NY,SF):
    import operator
    vocabulary,p0v,p1v=localwords (NY,SF)
    topny=[];topsf=[]
    For I in range (len (p0v)):
        if P0v[i]>-6.0:topsf.append ((vocabulary[i],p0v[i))
        if p1v[i]>-6.0: Topny.append ((Vocabulary[i],p1v[i]))
    #按照排序选择
    sortedsf=sorted (Topsf,key=lambda pair:pair[1],reverse=true
    #pair:p Air[1] represents the sort of
    print ("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF") for the second parameter of each
    element Item in SORTEDSF:
        print (item[0])
    sortedny=sorted (topny,key=lambda pair:pair[1],reverse=true)
    Print ("Ny**ny**ny**ny**ny**ny**ny**ny**ny**ny**ny**ny**ny")
    For item in Sortedny:
        print (item[0])


=========================================================================================

Download Installation Feedsparse

Download Address: Click to open the link

Installation method: First convert the path to the folder

     then enter instruction python setup.py install

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.