Background: Advertisers often want to know some specific demographic information about a person so that they can better orient their advertising.
We will choose from two cities in the United States, and analyze the information published by these people to compare whether the two cities are different in terms of advertising. If the conclusions are different, then the words they often use are those, and from the people's words, can we have a better understanding of what people in different cities are concerned about.
1. Collect Data: Import RSS Feeds
Download text using Python, browse related documents under http://code.google.com/p/feedparser/, install Feedparse, unpack the downloaded package first, and switch the current directory to the folder where the extracted files are located . Then enter the following at the python prompt:
# python setup.py Install
Create a bayes.py file and add the following code:
#创建一个包含在所有文档中出现的不重复词的列表def createvocablist (DataSet): Vocabset=set ([]) #创建一个空集 for document in DATASET:VOC Abset=vocabset|set (document) #创建两个集合的并集 return list (vocabset) def setofwords2vecmn (vocablist,inputset): returnvec=[ 0]*len (vocablist) #创建一个其中所含元素都为0的向量 for word in Inputset:if word in vocablist:returnvec[vocabl Ist.index (word)]+=1 return returnvec# naive Bayesian classifier training function def Trainnbo (trainmatrix,traincategory): Numtraindocs=len ( Trainmatrix) Numwords=len (trainmatrix[0]) pabusive=sum (traincategory)/float (Numtraindocs) p0Num=ones (numWords);p 1 Num=ones (numwords) #计算p (w0|1) p (w1|1), avoiding one of the probability values is 0, the last product is 0 p0demo=2.0;p1demo=2.0 #初始化概率 for i in range (Numtraindocs ): If Traincategory[i]==1:p1num+=trainmatrix[i] P1demo+=sum (Trainmatrix[i]) Els E:p0num+=trainmatrix[i] P0demo+=sum (trainmatrix[i]) #p1Vect =p1num/p1demo #p0Vect =p0num/ P0demo P1vect=log (P1num/p1demo) #计算p (w0|1) p (w1|1), most of the factors are very small, the program will overflow or not get the correct answer (multiply many decimal, the final rounding will be 0) P0vect=log (p0num/p0demo) return p0vect,p1vect, pabusive# Naive Bayes classification function Def classifynb (VEC2CLASSIFY,P0VEC,P1VEC,PCLASS1): P1=sum (Vec2classify*p1vec) +log (PCLASS1) p0= SUM (Vec2classify*p0vec) +log (1.0-PCLASS1) if P1>p0:return 1 else:return 0# file parsing and testing function def textparse (b igstring): Import re Listoftokens=re.split (R ' \w* ', bigstring) return [Tok.lower () for Tok in Listoftokens if Len (t OK) >2]
Add the following code:
#RSS source classifier and high frequency word removal functiondefCalcmostfreq (vocablist,fulltext):Importoperator Freqdict={} forTokeninchVocablist:freqdict[token]=Fulltext.count (token) sortedfreq=sorted (Freqdict.iteritems (), Key=operator.itemgetter (1), reverse=True)returnSortedfreq[:30]deflocalwords (feed1,feed0):ImportFeedparser docList=[];classlist=[];fulltext=[] Minlen=min (Len (feed1['Entries']), Len (feed0['Entries'])) forIinchRange (Minlen): WordList=textparse (feed1['Entries'][i]['Summary']) doclist.append (wordList) fulltext.extend (wordList) classlist.append (1) WordList=textparse (feed0['Entries'][i]['Summary']) doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist=createvocablist (docList) top30words=calcmostfreq (Vocablist,fulltext) forPairwinchtop30words:ifPAIRW[0]inchVocabList:vocabList.remove (pairw[0]) Trainingset=range (2*minlen); testset=[] forIinchRange (20): Randindex=Int (random.uniform (0,len (Trainingset))) Testset.append (Trainingset[randindex])del(Trainingset[randindex]) Trainmat=[];trainclasses=[] forDocindexinchtrainingSet:trainMat.append (BAGOFWORDS2VECMN (Vocablist,doclist[docindex)) Trainclasses.append (Classli St[docindex]) P0v,p1v,pspam=Trainnbo (Array (trainmat), Array (trainclasses)) Errorcount=0 forDocindexinchTestset:wordvector=bagofwords2vecmn (Vocablist,doclist[docindex])ifCLASSIFYNB (Array (wordvector), p0v,p1v,pspam)! =Classlist[docindex]: Errorcount+=1Print 'The error rate is :', float (errorcount)/Len (testset)returnVocablist,p0v,p1v
The function localwords () uses two RSS feeds as parameters, and the RSS feeds are imported outside the function, because the RSS feeds change over time, and reloading the RSS feeds will result in new data.
>>>Reload (Bayes)<module'Bayes' from 'Bayes.pyc'>>>>ImportFeedparser>>> Ny=feedparser.parse ('Http://newyork.craigslist.org/stp/index.rss')>>> Sy=feedparser.parse ('Http://sfbay.craigslist.org/stp/index.rss')>>> vocablist,psf,pny=Bayes.localwords (NY,SF) the error rate is: 0.2>>> vocablist,psf,pny=Bayes.localwords (NY,SF) the error rate is: 0.3>>> vocablist,psf,pny=Bayes.localwords (NY,SF) the error rate is: 0.55
In order to get an accurate estimate of the error rate, the above experiment should be done several times and then averaged
2, analysis of data: display of regional-related terms
The vector psf and PNY can be sorted first, then printed in order, and the following code is added to the file:
#最具表征性的词汇显示函数def gettopwords (NY,SF): import operator vocablist,p0v,p1v=localwords (NY,SF) topny=[]; Topsf=[] for i in range (len (p0v)): if P0v[i]>-6.0:topsf.append ((Vocablist[i],p0v[i])) if p1v[i]>- 6.0:topny.append ((vocablist[i],p1v[i)) sortedsf=sorted (Topsf,key=lambda pair:pair[1],reverse=true) Print "sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**sf**" for item in SORTEDSF: print item[0] sortedny=sorted (Topny,key=lambda pair:pair[1],reverse=true) print "ny**ny**ny**ny**ny**ny**ny**ny**ny**ny** ny**ny**ny**ny** "for item in Sortedny: print item[0]
The function gettopwords () uses two RSS feeds as input, then trains and tests the naive Bayesian classifier to return the used probability values. Then create two lists for the storage of tuples, which, unlike the previous X-word that returns the highest ranking, can return all words that are larger than a certain threshold, sorted by their conditional probabilities.
Save the bayes.py file and enter it at the Python prompt:
Naive Bayesian acquisition of regional tendency--python from personal advertisement