4.7 Example: Using naive Bayesian classifier to derive regional tendencies from personal ads
Two applications were described earlier: 1. Filtering malicious messages from websites; 2. Filter spam.
4.7.1 Collecting data: Importing RSS Feeds
The Universal feed parser is the most commonly used RSS library in Python.
At the python prompt, enter:
Build similar to the Spamtest () function to automate the testing process.
#RSS source classifier and high frequency word removal functiondefCalcmostfreg (Vocablist, fulltext):Importoperator Freqdict= {} forTokeninchVocablist:#iterate through each word in the glossary to count how many times it appears in the textFreqdict[token] =Fulltext.count (token) sortedfreq= Sorted (Freqdict.iteritems (), key = Operator.itemgetter (1), reverse =True)returnSORTEDFREQ[:30]#returns the top 30 words in a sorted orderdeflocalwords (feed1, feed0):ImportFeedparser docList= []; Classlist = []; Fulltext =[] Minlen= Min (len (feed1['Entries']), Len (feed0['Entries'])) forIinchRange (Minlen):#One RSS feed per visitWordList = Textparse (feed1['Entries'][i]['Summary']) doclist.append (wordList) fulltext.extend (wordList) classlist.append (1) WordList= Textparse (feed0['Entries'][i]['Summary']) doclist.append (wordList) fulltext.extend (wordList) classlist.append (0) vocablist=createvocablist (docList)#Remove the words that are the most frequently seenTop30words =Calcmostfreg (vocablist, Fulltext) forPairwinchtop30words:ifPAIRW[0]inchVocabList:vocabList.remove (pairw[0]) Trainingset= Range (2 * minlen); Testset = [] forIinchRange (20):#randomly extract 20 files as TestsetRandindex =Int (random.uniform (0, Len (trainingset))) Testset.append (Trainingset[randindex])del(Trainingset[randindex]) Trainmat= []; Trainclasses = [] forDocindexinchtrainingSet:trainMat.append (BAGOFWORDS2VECMN (Vocablist, Doclist[docindex]) trainclasses.append (Classl Ist[docindex]) p0v, p1v, Pspam=trainNB0 (Array (trainmat), Array (trainclasses)) Errorcount=0 forDocindexinchTestset:wordvector=bagofwords2vecmn (Vocablist, Doclist[docindex])ifCLASSIFYNB (Array (wordvector), p0v, P1V, pspam)! =Classlist[docindex]: Errorcount+ = 1Print 'The error rate is :', float (errorcount)/Len (testset)returnVocablist, p0v, p1v
4.7.2 Analysis Data: Displays the area-related terms
#lexical display functions with the most table featuresdefgettopwords (NY, SF):Importoperator Vocablist, p0v, p1v=locablwords (NY, SF) TOPNY= []; TOPSF = []#Create a list for meta-ancestor storage forIinchRange (len (p0v)):ifP0v[i] >-6.0: Topsf.append ((Vocablist[i], p0v[i]))ifP1v[i] >-6.0: Topny.append ((Vocablist[i], p1v[i]) SORTEDSF= Sorted (TOPSF, key =LambdaPAIR:PAIR[1], reverse =True)Print "sf**sf**sf**sf**sf**sf**sf**sf**" forIteminchSORTEDSF:PrintItem[0] Sortedny= Sorted (topny, key =LambdaPAIR:PAIR[1], reverse =True)Print "ny**ny**ny**ny**ny**ny**ny**ny**" forIteminchSortedny:PrintITEM[0]
4 Classification method based on probability theory: Naive Bayes (iii)