自然語言處理3.6——正常化文本,自然語言處理3.6
在前面的例子中,在處理文本詞彙前經常要將文本轉化成小寫,即(w.lower() for w in words).通過lower()將文本正常化為小寫,這樣一來,"The"和"the"的區別被忽略了。
我們常常進行更多的嘗試,例如去掉文本中的所有詞綴已經提取詞乾的任務等。下一步是確保結果形式是字典中確定的詞,即詞形歸併任務。首先定義一下本節使用的資料。
>>>raw="""DENNIS: Listen, strange women lying in ponds distributing swordsis no basis for a system of government. Supreme executive power derives froma mandate from the masses, not from some farcical aquatic ceremony.""">>>tokens=nltk.word_tokenize(raw)
1.詞幹提取器
NLTK中包括了一個現成的詞幹提取器,如果需要使用詞幹提取器,應該優先使用它們中的一個,而不是使用Regex製作自己的詞幹提取器,因為NLTK中的詞幹提取器能處理的不規則情況很廣泛。Porter和Lancaster詞幹提取器按照他們的規則剝離詞綴。下面的例子表明Porter詞幹提取器正確處理了lying(將他映射為lie),而Lancaster詞幹提取器並沒有處理。
>>>import nltk>>>porter=nltk.PorterStemmer()>>>lancaster=nltk.LancasterStemmer()>>>print([porter.stem(t) for t in tokens])['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']>>>print([lancaster.stem(t) for t in tokens])['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut','sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem','execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not','from', 'som', 'farc', 'aqu', 'ceremony', '.']
詞幹提取器過程沒有明確定義,通常選擇合適應用的詞幹提取器。如果要索引文本或者使搜尋支援不同詞彙形式的話,Porter詞幹提取器是一個很好的選擇。
class IndexedText(object): def __init__(self, stemmer, text): self._text = text self._stemmer = stemmer self._index = nltk.Index((self._stem(word), i) for (i, word) in enumerate(text)) def concordance(self, word, width=40): key = self._stem(word) wc = int(width/4) # words of context for i in self._index[key]: lcontext = ' '.join(self._text[i-wc:i]) rcontext = ' '.join(self._text[i:i+wc]) ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width) rdisplay = '{:{width}}'.format(rcontext[:width], width=width) print(ldisplay, rdisplay) def _stem(self, word): return self._stemmer.stem(word).lower()
>>> porter = nltk.PorterStemmer()>>> grail = nltk.corpus.webtext.words('grail.txt')>>> text = IndexedText(porter, grail)>>> text.concordance('lie')r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Wellere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave knot stop our fight ' til each one of you lies dead , and the Holy Grail returns t
2.詞形歸併
WordNet詞形歸併器刪除詞綴產生的詞,都是它的字典中的詞。這個額外的檢查過程會使得速度變慢:
>>> wnl = nltk.WordNetLemmatizer()>>> [wnl.lemmatize(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond','distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of','government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a','mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical','aquatic', 'ceremony', '.']
如果想要編輯一些文本詞彙,或者想要一個有效詞條(中心詞)列表,WordNet詞形歸併器是一個不錯的選擇、