Naive Bayesian acquisition of regional tendency--python from personal advertisement

Source: Internet
Author: User

Background: Advertisers often want to know some specific demographic information about a person so that they can better orient their advertising.

We will choose from two cities in the United States, and analyze the information published by these people to compare whether the two cities are different in terms of advertising. If the conclusions are different, then the words they often use are those, and from the people's words, can we have a better understanding of what people in different cities are concerned about.

1. Collect Data: Import RSS Feeds

Download text using Python, browse related documents under http://code.google.com/p/feedparser/, install Feedparse, unpack the downloaded package first, and switch the current directory to the folder where the extracted files are located . Then enter the following at the python prompt:

# python setup.py Install

Create a bayes.py file and add the following code:

#创建一个包含在所有文档中出现的不重复词的列表def createvocablist (DataSet): Vocabset=set ([]) #创建一个空集 for document in DATASET:VOC Abset=vocabset|set (document) #创建两个集合的并集 return list (vocabset) def setofwords2vecmn (vocablist,inputset): returnvec=[ 0]*len (vocablist) #创建一个其中所含元素都为0的向量 for word in Inputset:if word in vocablist:returnvec[vocabl Ist.index (word)]+=1 return returnvec# naive Bayesian classifier training function def Trainnbo (trainmatrix,traincategory): Numtraindocs=len ( Trainmatrix) Numwords=len (trainmatrix[0]) pabusive=sum (traincategory)/float (Numtraindocs) p0Num=ones (numWords);p 1 Num=ones (numwords) #计算p (w0|1) p (w1|1), avoiding one of the probability values is 0, the last product is 0 p0demo=2.0;p1demo=2.0 #初始化概率 for i in range (Numtraindocs ): If Traincategory[i]==1:p1num+=trainmatrix[i] P1demo+=sum (Trainmatrix[i]) Els E:p0num+=trainmatrix[i] P0demo+=sum (trainmatrix[i]) #p1Vect =p1num/p1demo #p0Vect =p0num/ P0demo P1vect=log (P1num/p1demo) #计算p (w0|1) p (w1|1), most of the factors are very small, the program will overflow or not get the correct answer (multiply many decimal, the final rounding will be 0) P0vect=log (p0num/p0demo) return p0vect,p1vect, pabusive# Naive Bayes classification function Def classifynb (VEC2CLASSIFY,P0VEC,P1VEC,PCLASS1): P1=sum (Vec2classify*p1vec) +log (PCLASS1) p0= SUM (Vec2classify*p0vec) +log (1.0-PCLASS1) if P1>p0:return 1 else:return 0# file parsing and testing function def textparse (b igstring): Import re Listoftokens=re.split (R ' \w* ', bigstring) return [Tok.lower () for Tok in Listoftokens if Len (t OK) >2]

Add the following code:

#RSS source classifier and high frequency word removal functiondefCalcmostfreq (vocablist,fulltext):Importoperator Freqdict={}     forTokeninchVocablist:freqdict[token]=Fulltext.count (token) sortedfreq=sorted (Freqdict.iteritems (), Key=operator.itemgetter (1), reverse=True)returnSortedfreq[:30]deflocalwords (feed1,feed0):ImportFeedparser docList=[];classlist=[];fulltext=[] Minlen=min (Len (feed1['Entries']), Len (feed0['Entries']))     forIinchRange (Minlen): WordList=textparse (feed1['Entries'][i]['Summary']) doclist.append (wordList) fulltext.extend (wordList) classlist.append (1) WordList=textparse (feed0['Entries'][i]['Summary']) doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist=createvocablist (docList) top30words=calcmostfreq (Vocablist,fulltext) forPairwinchtop30words:ifPAIRW[0]inchVocabList:vocabList.remove (pairw[0]) Trainingset=range (2*minlen); testset=[]     forIinchRange (20): Randindex=Int (random.uniform (0,len (Trainingset))) Testset.append (Trainingset[randindex])del(Trainingset[randindex]) Trainmat=[];trainclasses=[]     forDocindexinchtrainingSet:trainMat.append (BAGOFWORDS2VECMN (Vocablist,doclist[docindex)) Trainclasses.append (Classli St[docindex]) P0v,p1v,pspam=Trainnbo (Array (trainmat), Array (trainclasses)) Errorcount=0 forDocindexinchTestset:wordvector=bagofwords2vecmn (Vocablist,doclist[docindex])ifCLASSIFYNB (Array (wordvector), p0v,p1v,pspam)! =Classlist[docindex]: Errorcount+=1Print 'The error rate is :', float (errorcount)/Len (testset)returnVocablist,p0v,p1v

The function localwords () uses two RSS feeds as parameters, and the RSS feeds are imported outside the function, because the RSS feeds change over time, and reloading the RSS feeds will result in new data.

>>>Reload (Bayes)<module'Bayes'  from 'Bayes.pyc'>>>>ImportFeedparser>>> Ny=feedparser.parse ('Http://newyork.craigslist.org/stp/index.rss')>>> Sy=feedparser.parse ('Http://sfbay.craigslist.org/stp/index.rss')>>> vocablist,psf,pny=Bayes.localwords (NY,SF) the error rate is: 0.2>>> vocablist,psf,pny=Bayes.localwords (NY,SF) the error rate is: 0.3>>> vocablist,psf,pny=Bayes.localwords (NY,SF) the error rate is: 0.55

In order to get an accurate estimate of the error rate, the above experiment should be done several times and then averaged

2, analysis of data: display of regional-related terms

The vector psf and PNY can be sorted first, then printed in order, and the following code is added to the file:

#最具表征性的词汇显示函数def gettopwords (NY,SF):    import operator    vocablist,p0v,p1v=localwords (NY,SF)    topny=[]; Topsf=[] for    i in range (len (p0v)):        if P0v[i]>-6.0:topsf.append ((Vocablist[i],p0v[i]))        if p1v[i]>- 6.0:topny.append ((vocablist[i],p1v[i))    sortedsf=sorted (Topsf,key=lambda pair:pair[1],reverse=true)    Print "sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**" for    item in SORTEDSF:        print item[0]    sortedny=sorted (Topny,key=lambda pair:pair[1],reverse=true)    print "ny**ny**ny**ny**ny**ny**ny**ny**ny**ny** ny**ny**ny**ny** "for    item in Sortedny:        print item[0]

The function gettopwords () uses two RSS feeds as input, then trains and tests the naive Bayesian classifier to return the used probability values. Then create two lists for the storage of tuples, which, unlike the previous X-word that returns the highest ranking, can return all words that are larger than a certain threshold, sorted by their conditional probabilities.

Save the bayes.py file and enter it at the Python prompt:

Naive Bayesian acquisition of regional tendency--python from personal advertisement

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.