Python 2.7
IDE PyCharm 5.0.3
資料分析熱個身啊,反正也看到自然語言處理這塊了。。
講在開頭
此文需要用到的相關知識包括資料清洗,Regex,字典,列表等。不然可能有點費勁。
什麼是N-Gram模型。
在自然語言裡有一個模型叫做n-gram,表示文字或語言中的n個連續的單片語成序列。在進行自然語言分析時,使用n-gram或者尋找常用片語,可以很容易的把一句話分解成若干個文字片段。摘自Python網路資料擷取[RyanMitchell著]
簡單來說,就是找到核心主題詞,那怎麼算核心主題詞呢,一般而言,重複率也就是提及次數最多的也就是最需要表達的就是核心詞。下面的例子也就從這個開始展開 臨時補充
在栗子中出現,這裡拿出來單獨先試一下效果
1.string.punctuation擷取所有標點符號,和strip搭配使用
import stringlist = ['a,','b!','cj!/n']item=[]for i in list: i =i.strip(string.punctuation) item.append(i)print item
['a', 'b', 'cj!/n']
2.operator.itemgetter()
operator模組提供的itemgetter函數用於擷取對象的哪些維的資料,參數為一些序號(即需要擷取的資料在對象中的序號)
栗子
import operatordict_={'name1':'2', 'name2':'1'}print sorted(dict_.items(),key=operator.itemgetter(0),reverse=True)#dict_.items(),索引值對
[('name2', '1'), ('name1', '2')]
當然,你可以直接直接使用這個
dict_={'name1':'2', 'name2':'1'}print sorted(dict_.iteritems(),key=lambda x:x[1],reverse=True)
2-gram
就以兩個關鍵詞來說吧,上個栗子來進行備忘講解
import urllib2import reimport stringimport operatordef cleanText(input): input = re.sub('\n+', " ", input).lower() # 匹配換行,用空格替換分行符號 input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標記 input = re.sub(' +', " ", input) # 把連續多個空格替換成一個空格 input = bytes(input)#.encode('utf-8') # 把內容轉換成utf-8格式以消除逸出字元 #input = input.decode("ascii", "ignore") return inputdef cleanInput(input): input = cleanText(input) cleanInput = [] input = input.split(' ') #以空格為分隔字元,返回列表 for item in input: item = item.strip(string.punctuation) # string.punctuation擷取所有標點符號 if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞,包括i,a等單個單詞 cleanInput.append(item) return cleanInputdef getNgrams(input, n): input = cleanInput(input) output = {} # 構造字典 for i in range(len(input)-n+1): ngramTemp = " ".join(input[i:i+n])#.encode('utf-8') if ngramTemp not in output: #詞頻統計 output[ngramTemp] = 0 #典型的字典操作 output[ngramTemp] += 1 return output#方法一:對網頁直接進行讀取content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()#方法二:對本地檔案的讀取,測試時候用,因為無需連網#content = open("1.txt").read()ngrams = getNgrams(content, 2)sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) #=True 降序排列print(sortedNGrams)
[('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34),,,巴拉巴拉一堆
上述栗子作用在於抓到2串連詞的頻率大小來排序的,但是這並不是我們想要的,你說這出現兩百多次的 of the 有個貓用啊,所以,我們要進行對這些串連詞啊介詞啊的剔除工作。 Deeper
# -*- coding: utf-8 -*-import urllib2import reimport stringimport operator#剔除常用字函數def isCommon(ngram): commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will","as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"] if ngram in commonWords: return True else: return Falsedef cleanText(input): input = re.sub('\n+', " ", input).lower() # 匹配換行用空格替換成空格 input = re.sub('\[[0-9]*\]', "", input) # 剔除類似[1]這樣的引用標記 input = re.sub(' +', " ", input) # 把連續多個空格替換成一個空格 input = bytes(input)#.encode('utf-8') # 把內容轉換成utf-8格式以消除逸出字元 #input = input.decode("ascii", "ignore") return inputdef cleanInput(input): input = cleanText(input) cleanInput = [] input = input.split(' ') #以空格為分隔字元,返回列表 for item in input: item = item.strip(string.punctuation) # string.punctuation擷取所有標點符號 if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出單詞,包括i,a等單個單詞 cleanInput.append(item) return cleanInputdef getNgrams(input, n): input = cleanInput(input) output = {} # 構造字典 for i in range(len(input)-n+1): ngramTemp = " ".join(input[i:i+n])#.encode('utf-8') if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]): pass else: if ngramTemp not in output: #詞頻統計 output[ngramTemp] = 0 #典型的字典操作 output[ngramTemp] += 1 return output#擷取核心詞在的句子def getFirstSentenceContaining(ngram, content): #print(ngram) sentences = content.split(".") for sentence in sentences: if ngram in sentence: return sentence return ""#方法一:對網頁直接進行讀取content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()#對本地檔案的讀取,測試時候用,因為無需連網#content = open("1.txt").read()ngrams = getNgrams(content, 2)sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) # reverse=True 降序排列print(sortedNGrams)for top3 in range(3): print "###"+getFirstSentenceContaining(sortedNGrams[top3][0],content.lower())+"###"
[('united states', 10), ('general government', 4), ('executive department', 4), ('legisltive bojefferson', 3), ('same causes', 3), ('called upon', 3), ('chief magistrate', 3), ('whole country', 3), ('government should', 3),,,,巴拉巴拉一堆### the constitution of the united states is the instrument containing this grant of power to the several departments composing the government###### the general government has seized upon none of the reserved rights of the states###### such a one was afforded by the executive department constituted by the constitution###
從上述栗子我們可以看出,我們對有用詞進行了刪選,去掉了串連詞,取出核心詞排序,然後再把包含核心詞的句子抓出來,這裡我只是抓了前三句,對於有兩三百個句子的文章,用三四句話概括起來,我想還是比較神奇的。 BUT
上述的方法限於主旨很明確的會議等,不然,對於小說,簡直慘目忍睹的,我試了好幾個英文小說,簡直了,總結的是啥玩意。。。。 最後
材料來自於Python網路資料擷取第八章,但是代碼是python3.x的,而且有一些代碼案例上跑不出來,所以整理一下,自己修改了一些程式碼片段,才跑出書上的效果。 致謝
Python網路資料擷取[Ryan Mitchell著][人民郵電出版社]
python strip()函數 介紹
Python中的sorted函數以及operator.itemgetter函數