Python 2.7
IDE Pycharm 5.0.3
Data analysis Hot body ah, anyway, also see natural language processing this piece.
Speaking at the beginning
The relevant knowledge to be used in this article includes data cleansing, regular expressions, dictionaries, lists, and so on. Otherwise, it might be a bit strenuous.
What is the N-gram model.
In natural language there is a model called N-gram, which represents n consecutive words in a word or language. In natural language analysis, it is easy to break down a sentence into several pieces of text by using N-gram or by looking for a common phrase. Excerpt from Python network data acquisition [Ryanmitchell]
In short, is to find the core keyword, that how to calculate the core keyword, in general, the repetition rate is the most frequently mentioned is the most important word is the core words. The following example starts with a temporary supplement .
appear in the chestnuts, take it out here and try the effect alone
1.string.punctuation gets all the punctuation and uses the strip
Import string
list = [' A, ', ' b! ', ' cj!/n ']
item=[] for
i in list:
i =i.strip (string.punctuation)
Item.append (i)
print item
[' A ', ' B ', ' cj!/n ']
2.operator.itemgetter ()
The Itemgetter function provided by the operator module is used to get the dimension data of the object, with some ordinal number (that is, the ordinal of the data to be obtained in the object).
Chestnuts
Import operator
dict_={' name1 ': ' 2 ',
' name2 ': ' 1 '}
print sorted (Dict_.items (), Key=operator.itemgetter ( 0), reverse=true)
#dict_. Items (), key value pairs
[(' Name2 ', ' 1 '), (' name1 ', ' 2 ')]
Of course, you can directly use this
dict_={' name1 ': ' 2 ', '
name2 ': ' 1 '}
print sorted (Dict_.iteritems (), Key=lambda x:x[1],reverse=true)
2-gram
For two keywords, the last chestnut to explain the notes
Import urllib2 Import re import string import operator Def cleantext (input): input = re.sub (' \n+ ', "", input). Lower ( ) # match newline, replace line feed with space = Re.sub (' \[[0-9]*\] ', ' ", input) # Eliminate reference tags like [1] (' + ', '", input ') # put continuous
Multiple spaces replaced with a space input = bytes (input) #.encode (' Utf-8 ') # converts content to utf-8 format to eliminate escape characters #input = Input.decode ("ASCII", "ignore") return input def cleanInput (input): input = Cleantext (input) cleanInput = [] input = Input.split (') #以空 Grid is delimiter, return list for item in Input:item = Item.strip (string.punctuation) # string.punctuation get all punctuation if
Len (item) > 1 or (item.lower () = = ' A ' or item.lower () = = ' I '): #找出单词, including single words such as I,a cleaninput.append (item) Return cleanInput def getngrams (input, n): input = cleanInput (input) output = {} # construct dictionary for I in range (len
(input)-n+1): Ngramtemp = "". Join (Input[i:i+n]) #.encode (' Utf-8 ') if ngramtemp not in output: #词频统计 OUTPUT[NGRAMTEMP]= 0 #典型的字典操作 Output[ngramtemp] + 1 return output #方法一: Read content directly to the page = Urllib2.urlopen (urllib2. Request ("Http://pythonscraping.com/files/inaugurationSpeech.txt")). Read () #方法二: Read local files, test time, because no networking #content = Open ("1.txt"). Read () Ngrams = getngrams (content, 2) Sortedngrams = sorted (Ngrams.items (), key = Operator.itemgetter (1), R
Everse=true) #=true descending order print (Sortedngrams)
[[' The ', 213], (' In the ', ","), (' to the ', *), (' by the ', ","), (' The Constitution ',),,, Balabala a bunch of
The chestnut function is to catch 2 of the frequency of the connection words to sort, but this is not what we want, you say this appears more than 200 times of the cat use Ah, so, we have to do with these connectives ah prepositions of the elimination of work. Deeper
#-*-coding:utf-8-*-import urllib2 import re import string import operator #剔除常用字函数 def iscommon (ngram): commonw ords = ["The", "Be ', ' and ', ' of ', ' a ', ' in ', ' to ', ' have ', ' it ', ' I ', ' that ', ' for ', ' you ', ' he ', ' with
', ' on ', ' do ', ' say ', ' this ', ' they ', ' is ', ' an ', ' at ', ' but ', ' we ', ' his ', ' from ', ' that ', ' Not ', ' by ', ' she ', ' or ', ' as ', ' What ', ' go ', ' their ', ' can ', ' who ', ' get ', ' if ', ' would ', ' her ', ' All ', ' my ', ' make ', ' about ', ' know ', ' would ', ' as ', ' up ', ' one ', ' time ', ' has ', ' been ', ' there ', ' year ' , "So", "I", "when", "which", "them", "some", "Me", "people", "take", "out", " To, "just", "you", "him", "your", "Come", "could", "now", "than", "as", "other", "how", "then", "it "," we "," two "," more "," This "," want "," way "," look "," a "," also "," new "," because "," Day "," mor E ","Use ', ' no ', ' man ', ' find ', ' here ', ' thing ', ' Give ', ' many ', ' the OK ' if ngram in Commonwords:return True Else:return False def cleantext (input): input = re.sub (' \n+ ', "", input). Lower () # Matching newline with space Replace with space input = Re.sub (' \[[0-9]*\] ', "", input) # Eliminate reference tags such as [1] = re.sub (' + ', ' ", input ') # replaces consecutive spaces with a single space input = bytes (input) #.encode (' Utf-8 ') # converts content to utf-8 format to eliminate escape characters #input = Input.decode ("ASCII", "ignore") return input def cleanin Put (input): input = Cleantext (input) cleanInput = [] input = Input.split (") #以空格为分隔符, return list for item in Input:item = Item.strip (string.punctuation) # string.punctuation gets all punctuation marks if Len (item) > 1 or (Item.lo Wer () = = ' A ' or item.lower () = = ' I ': #找出单词, including i,a and other single words cleaninput.append (item) return CleanInput def GETN grams (input, n): input = cleanInput (input) output = {} # construct dictionary for I in range (len (input)-n+1): Ngramte MP = "". Join (input[i:i+N]) #.encode (' Utf-8 ') if Iscommon (Ngramtemp.split () [0]) or Iscommon (Ngramtemp.split () [1]): Pass Else:if ngramtemp not in output: #词频统计 output[ngramtemp] = 0 #典型的字典操作 output[ng Ramtemp] + 1 return output #获取核心词在的句子 def getfirstsentencecontaining (Ngram, content): #print (ngram) sentence
s = Content.split (".") For sentence in Sentences:if Ngram on Sentence:return sentence return "" #方法一: Read content directly to the page = Urllib2.urlopen (urllib2. Request ("Http://pythonscraping.com/files/inaugurationSpeech.txt"). Read () #对本地文件的读取, test time, because no networking #content = open (" 1.txt "). Read () Ngrams = getngrams (content, 2) Sortedngrams = sorted (Ngrams.items (), key = Operator.itemgetter (1), reverse =true) # Reverse=true descending order print (Sortedngrams) for TOP3 in range (3): print "###" +getfirstsentencecontaining (Sortedngra
Ms[top3][0],content.lower ()) + "###"
[(' United States ', Ten], (' General Government ', 4), (' Executive department ', 4), (' Legisltive Bojefferson ', 3), (' Same caus Es ', 3), (' called upon ', 3), (' Chief Magistrate ', 3), (' Whole country ', 3), (' Government should ', 3),,,, Balabala a bunch of ### the
Constitution of the United States is the instrument containing this grant of power to the several departments composing T He government###
### The government has seized upon none of the reserved rights of the states### ###
A one is afforded by the executive Department constituted by the constitution###
From the chestnuts above we can see that we have deleted the useful words, remove the connection word, take out the core word sort, and then put the sentence containing the core words, here I just grabbed the first three sentences, for two hundred or three hundred sentences of the article, with three or four words summed up, I think it is more magical. BUT
The above method is limited to the purpose of a very clear meeting, etc., otherwise, for the novel, simply miserable to see, I tried several English novels, simply, summed up what the thing .... last
The material comes from the Python Network data acquisition eighth chapter, but the code is python3.x, and some code cases can not run out, so tidy up, modify some of the code snippets, just ran out of the book effect. Thanks
Python network data acquisition [Ryan Mitchell] [People's posts and Telecommunications press]
the Python strip () function describes the
sorted function in Python and the Operator.itemgetter function