利用N-Gram模型概括資料（Python描述）_

利用N-Gram模型概括資料（Python描述）__Python

最後更新：2018-07-24 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

Python 2.7
IDE PyCharm 5.0.3

資料分析熱個身啊，反正也看到自然語言處理這塊了。。

講在開頭

此文需要用到的相關知識包括資料清洗，Regex，字典，列表等。不然可能有點費勁。

什麼是N-Gram模型。

在自然語言裡有一個模型叫做n-gram，表示文字或語言中的n個連續的單片語成序列。在進行自然語言分析時，使用n-gram或者尋找常用片語，可以很容易的把一句話分解成若干個文字片段。摘自Python網路資料擷取[RyanMitchell著]

簡單來說，就是找到核心主題詞，那怎麼算核心主題詞呢，一般而言，重複率也就是提及次數最多的也就是最需要表達的就是核心詞。下面的例子也就從這個開始展開 臨時補充

在栗子中出現，這裡拿出來單獨先試一下效果

1.string.punctuation擷取所有標點符號，和strip搭配使用

import stringlist = ['a,','b!','cj!/n']item=[]for i in list:    i =i.strip(string.punctuation)    item.append(i)print item

['a', 'b', 'cj!/n']

2.operator.itemgetter()
operator模組提供的itemgetter函數用於擷取對象的哪些維的資料，參數為一些序號（即需要擷取的資料在對象中的序號）

栗子

import operatordict_={'name1':'2',      'name2':'1'}print sorted(dict_.items(),key=operator.itemgetter(0),reverse=True)#dict_.items()，索引值對

[('name2', '1'), ('name1', '2')]

當然，你可以直接直接使用這個

dict_={'name1':'2',      'name2':'1'}print sorted(dict_.iteritems(),key=lambda x:x[1],reverse=True)

2-gram

就以兩個關鍵詞來說吧，上個栗子來進行備忘講解

import urllib2import reimport stringimport operatordef cleanText(input):    input = re.sub('\n+', " ", input).lower() # 匹配換行,用空格替換分行符號    input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標記    input = re.sub(' +', " ", input) #  把連續多個空格替換成一個空格    input = bytes(input)#.encode('utf-8') # 把內容轉換成utf-8格式以消除逸出字元    #input = input.decode("ascii", "ignore")    return inputdef cleanInput(input):    input = cleanText(input)    cleanInput = []    input = input.split(' ') #以空格為分隔字元，返回列表    for item in input:        item = item.strip(string.punctuation) # string.punctuation擷取所有標點符號        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞，包括i,a等單個單詞            cleanInput.append(item)    return cleanInputdef getNgrams(input, n):    input = cleanInput(input)    output = {} # 構造字典    for i in range(len(input)-n+1):        ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')        if ngramTemp not in output: #詞頻統計            output[ngramTemp] = 0 #典型的字典操作        output[ngramTemp] += 1    return output#方法一：對網頁直接進行讀取content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()#方法二：對本地檔案的讀取，測試時候用，因為無需連網#content = open("1.txt").read()ngrams = getNgrams(content, 2)sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) #=True 降序排列print(sortedNGrams)

[('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34),,,巴拉巴拉一堆

上述栗子作用在於抓到2串連詞的頻率大小來排序的，但是這並不是我們想要的，你說這出現兩百多次的 of the 有個貓用啊，所以，我們要進行對這些串連詞啊介詞啊的剔除工作。 Deeper

# -*- coding: utf-8 -*-import urllib2import reimport stringimport operator#剔除常用字函數def isCommon(ngram):    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have",                   "it", "i", "that", "for", "you", "he", "with", "on", "do", "say",                   "this", "they", "is", "an", "at", "but","we", "his", "from", "that",                   "not", "by", "she", "or", "as", "what", "go", "their","can", "who",                   "get", "if", "would", "her", "all", "my", "make", "about", "know",                   "will","as", "up", "one", "time", "has", "been", "there", "year", "so",                   "think", "when", "which", "them", "some", "me", "people", "take", "out",                   "into", "just", "see", "him", "your", "come", "could", "now", "than",                   "like", "other", "how", "then", "its", "our", "two", "more", "these",                   "want", "way", "look", "first", "also", "new", "because", "day", "more",                   "use", "no", "man", "find", "here", "thing", "give", "many", "well"]    if ngram in commonWords:        return True    else:        return Falsedef cleanText(input):    input = re.sub('\n+', " ", input).lower() # 匹配換行用空格替換成空格    input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標記    input = re.sub(' +', " ", input) #  把連續多個空格替換成一個空格    input = bytes(input)#.encode('utf-8') # 把內容轉換成utf-8格式以消除逸出字元    #input = input.decode("ascii", "ignore")    return inputdef cleanInput(input):    input = cleanText(input)    cleanInput = []    input = input.split(' ') #以空格為分隔字元，返回列表    for item in input:        item = item.strip(string.punctuation) # string.punctuation擷取所有標點符號        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞，包括i,a等單個單詞            cleanInput.append(item)    return cleanInputdef getNgrams(input, n):    input = cleanInput(input)    output = {} # 構造字典    for i in range(len(input)-n+1):        ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')        if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):            pass        else:            if ngramTemp not in output: #詞頻統計                output[ngramTemp] = 0 #典型的字典操作            output[ngramTemp] += 1    return output#擷取核心詞在的句子def getFirstSentenceContaining(ngram, content):    #print(ngram)    sentences = content.split(".")    for sentence in sentences:        if ngram in sentence:            return sentence    return ""#方法一：對網頁直接進行讀取content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()#對本地檔案的讀取，測試時候用，因為無需連網#content = open("1.txt").read()ngrams = getNgrams(content, 2)sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) # reverse=True 降序排列print(sortedNGrams)for top3 in range(3):    print "###"+getFirstSentenceContaining(sortedNGrams[top3][0],content.lower())+"###"

[('united states', 10), ('general government', 4), ('executive department', 4), ('legisltive bojefferson', 3), ('same causes', 3), ('called upon', 3), ('chief magistrate', 3), ('whole country', 3), ('government should', 3),,,,巴拉巴拉一堆### the constitution of the united states is the instrument containing this grant of power to the several departments composing the government###### the general government has seized upon none of the reserved rights of the states###### such a one was afforded by the executive department constituted by the constitution###

從上述栗子我們可以看出，我們對有用詞進行了刪選，去掉了串連詞，取出核心詞排序，然後再把包含核心詞的句子抓出來，這裡我只是抓了前三句，對於有兩三百個句子的文章，用三四句話概括起來，我想還是比較神奇的。 BUT

上述的方法限於主旨很明確的會議等，不然，對於小說，簡直慘目忍睹的，我試了好幾個英文小說，簡直了，總結的是啥玩意。。。。最後

材料來自於Python網路資料擷取第八章，但是代碼是python3.x的,而且有一些代碼案例上跑不出來，所以整理一下，自己修改了一些程式碼片段，才跑出書上的效果。致謝

Python網路資料擷取[Ryan Mitchell著][人民郵電出版社]
python strip()函數介紹
Python中的sorted函數以及operator.itemgetter函數

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More