Use the N-gram model to generalize data (Python description) __python

Source: Internet
Author: User

Python 2.7
IDE Pycharm 5.0.3

Data analysis Hot body ah, anyway, also see natural language processing this piece.
Speaking at the beginning
The relevant knowledge to be used in this article includes data cleansing, regular expressions, dictionaries, lists, and so on. Otherwise, it might be a bit strenuous.
What is the N-gram model.

In natural language there is a model called N-gram, which represents n consecutive words in a word or language. In natural language analysis, it is easy to break down a sentence into several pieces of text by using N-gram or by looking for a common phrase. Excerpt from Python network data acquisition [Ryanmitchell]

In short, is to find the core keyword, that how to calculate the core keyword, in general, the repetition rate is the most frequently mentioned is the most important word is the core words. The following example starts with a temporary supplement .

appear in the chestnuts, take it out here and try the effect alone

1.string.punctuation gets all the punctuation and uses the strip

Import string
list = [' A, ', ' b! ', ' cj!/n ']
item=[] for
i in list:
    i =i.strip (string.punctuation)
    Item.append (i)
print item
[' A ', ' B ', ' cj!/n ']

2.operator.itemgetter ()
The Itemgetter function provided by the operator module is used to get the dimension data of the object, with some ordinal number (that is, the ordinal of the data to be obtained in the object).

Chestnuts

Import operator
dict_={' name1 ': ' 2 ',
      ' name2 ': ' 1 '}

print sorted (Dict_.items (), Key=operator.itemgetter ( 0), reverse=true)
#dict_. Items (), key value pairs
[(' Name2 ', ' 1 '), (' name1 ', ' 2 ')]

Of course, you can directly use this

dict_={' name1 ': ' 2 ', '
      name2 ': ' 1 '}
print sorted (Dict_.iteritems (), Key=lambda x:x[1],reverse=true)
2-gram

For two keywords, the last chestnut to explain the notes

Import urllib2 Import re import string import operator Def cleantext (input): input = re.sub (' \n+ ', "", input). Lower ( ) # match newline, replace line feed with space = Re.sub (' \[[0-9]*\] ', ' ", input) # Eliminate reference tags like [1] (' + ', '", input ') # put continuous
    Multiple spaces replaced with a space input = bytes (input) #.encode (' Utf-8 ') # converts content to utf-8 format to eliminate escape characters #input = Input.decode ("ASCII", "ignore") return input def cleanInput (input): input = Cleantext (input) cleanInput = [] input = Input.split (') #以空  Grid is delimiter, return list for item in Input:item = Item.strip (string.punctuation) # string.punctuation get all punctuation if
    Len (item) > 1 or (item.lower () = = ' A ' or item.lower () = = ' I '): #找出单词, including single words such as I,a cleaninput.append (item) Return cleanInput def getngrams (input, n): input = cleanInput (input) output = {} # construct dictionary for I in range (len
            (input)-n+1): Ngramtemp = "". Join (Input[i:i+n]) #.encode (' Utf-8 ') if ngramtemp not in output: #词频统计 OUTPUT[NGRAMTEMP]= 0 #典型的字典操作 Output[ngramtemp] + 1 return output #方法一: Read content directly to the page = Urllib2.urlopen (urllib2. Request ("Http://pythonscraping.com/files/inaugurationSpeech.txt")). Read () #方法二: Read local files, test time, because no networking #content = Open ("1.txt"). Read () Ngrams = getngrams (content, 2) Sortedngrams = sorted (Ngrams.items (), key = Operator.itemgetter (1), R
 Everse=true) #=true descending order print (Sortedngrams)
[[' The ', 213], (' In the ', ","), (' to the ', *), (' by the ', ","), (' The Constitution ',),,, Balabala a bunch of

The chestnut function is to catch 2 of the frequency of the connection words to sort, but this is not what we want, you say this appears more than 200 times of the cat use Ah, so, we have to do with these connectives ah prepositions of the elimination of work. Deeper

#-*-coding:utf-8-*-import urllib2 import re import string import operator #剔除常用字函数 def iscommon (ngram): commonw ords = ["The", "Be ', ' and ', ' of ', ' a ', ' in ', ' to ', ' have ', ' it ', ' I ', ' that ', ' for ', ' you ', ' he ', ' with
                   ', ' on ', ' do ', ' say ', ' this ', ' they ', ' is ', ' an ', ' at ', ' but ', ' we ', ' his ', ' from ', ' that ',  ' Not ', ' by ', ' she ', ' or ', ' as ', ' What ', ' go ', ' their ', ' can ', ' who ', ' get ', ' if ', ' would ', ' her ', ' All ', ' my ', ' make ', ' about ', ' know ', ' would ', ' as ', ' up ', ' one ', ' time ', ' has ', ' been ', ' there ', ' year ' , "So", "I", "when", "which", "them", "some", "Me", "people", "take", "out", " To, "just", "you", "him", "your", "Come", "could", "now", "than", "as", "other", "how", "then", "it "," we "," two "," more "," This "," want "," way "," look "," a "," also "," new "," because "," Day "," mor E ","Use ', ' no ', ' man ', ' find ', ' here ', ' thing ', ' Give ', ' many ', ' the OK ' if ngram in Commonwords:return True  Else:return False def cleantext (input): input = re.sub (' \n+ ', "", input). Lower () # Matching newline with space Replace with space input = Re.sub (' \[[0-9]*\] ', "", input) # Eliminate reference tags such as [1] = re.sub (' + ', ' ", input ') # replaces consecutive spaces with a single space input = bytes (input) #.encode (' Utf-8 ') # converts content to utf-8 format to eliminate escape characters #input = Input.decode ("ASCII", "ignore") return input def cleanin Put (input): input = Cleantext (input) cleanInput = [] input = Input.split (") #以空格为分隔符, return list for item in Input:item = Item.strip (string.punctuation) # string.punctuation gets all punctuation marks if Len (item) > 1 or (Item.lo Wer () = = ' A ' or item.lower () = = ' I ': #找出单词, including i,a and other single words cleaninput.append (item) return CleanInput def GETN grams (input, n): input = cleanInput (input) output = {} # construct dictionary for I in range (len (input)-n+1): Ngramte MP = "". Join (input[i:i+N]) #.encode (' Utf-8 ') if Iscommon (Ngramtemp.split () [0]) or Iscommon (Ngramtemp.split () [1]): Pass Else:if ngramtemp not in output: #词频统计 output[ngramtemp] = 0 #典型的字典操作 output[ng Ramtemp] + 1 return output #获取核心词在的句子 def getfirstsentencecontaining (Ngram, content): #print (ngram) sentence
    s = Content.split (".")  For sentence in Sentences:if Ngram on Sentence:return sentence return "" #方法一: Read content directly to the page = Urllib2.urlopen (urllib2. Request ("Http://pythonscraping.com/files/inaugurationSpeech.txt"). Read () #对本地文件的读取, test time, because no networking #content = open (" 1.txt "). Read () Ngrams = getngrams (content, 2) Sortedngrams = sorted (Ngrams.items (), key = Operator.itemgetter (1), reverse =true) # Reverse=true descending order print (Sortedngrams) for TOP3 in range (3): print "###" +getfirstsentencecontaining (Sortedngra
 Ms[top3][0],content.lower ()) + "###"
[(' United States ', Ten], (' General Government ', 4), (' Executive department ', 4), (' Legisltive Bojefferson ', 3), (' Same caus  Es ', 3), (' called upon ', 3), (' Chief Magistrate ', 3), (' Whole country ', 3), (' Government should ', 3),,,, Balabala a bunch of ### the

Constitution of the United States is the instrument containing this grant of power to the several departments composing T He government###
### The government has seized upon none of the reserved rights of the states### ###
A one is afforded by the executive Department constituted by the constitution###

From the chestnuts above we can see that we have deleted the useful words, remove the connection word, take out the core word sort, and then put the sentence containing the core words, here I just grabbed the first three sentences, for two hundred or three hundred sentences of the article, with three or four words summed up, I think it is more magical. BUT

The above method is limited to the purpose of a very clear meeting, etc., otherwise, for the novel, simply miserable to see, I tried several English novels, simply, summed up what the thing .... last

The material comes from the Python Network data acquisition eighth chapter, but the code is python3.x, and some code cases can not run out, so tidy up, modify some of the code snippets, just ran out of the book effect. Thanks

Python network data acquisition [Ryan Mitchell] [People's posts and Telecommunications press]
the Python strip () function describes the
sorted function in Python and the Operator.itemgetter function

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.