Natural Language Processing 3.6-normalized text, natural language processing 3.6
In the previous example, the text is often converted into lowercase letters before being processed, that is, (w. lower () for w in words ). use lower () to normalize text to lowercase, so that The difference between "the" and "The" is ignored.
We often make more attempts, such as removing all the Suffixes in the text and extracting the stem tasks. The next step is to ensure that the result form is the word identified in the dictionary, that is, the word form merge task. First, define the data used in this section.
>>>raw="""DENNIS: Listen, strange women lying in ponds distributing swordsis no basis for a system of government. Supreme executive power derives froma mandate from the masses, not from some farcical aquatic ceremony.""">>>tokens=nltk.word_tokenize(raw)
1. Stem extraction device
NLTK includes a ready-made stem Extraction Tool. If you want to use a stem Extraction Tool, you should first use one of them instead of using regular expressions to create your own stem Extraction Tool, because NLTK stem extractors can handle a wide range of irregular situations. Porter and Lancaster stem extractors strip suffixes according to their rules. The following example shows that Porter correctly processes lying (ing it to lie), while Lancaster does not.
>>>import nltk>>>porter=nltk.PorterStemmer()>>>lancaster=nltk.LancasterStemmer()>>>print([porter.stem(t) for t in tokens])['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']>>>print([lancaster.stem(t) for t in tokens])['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut','sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem','execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not','from', 'som', 'farc', 'aqu', 'ceremony', '.']
The process of stem extraction is not clearly defined. Generally, a stem extraction device suitable for application is selected. If you want to index text or make search support different vocabulary forms, Porter stem extraction is a good choice.
class IndexedText(object): def __init__(self, stemmer, text): self._text = text self._stemmer = stemmer self._index = nltk.Index((self._stem(word), i) for (i, word) in enumerate(text)) def concordance(self, word, width=40): key = self._stem(word) wc = int(width/4) # words of context for i in self._index[key]: lcontext = ' '.join(self._text[i-wc:i]) rcontext = ' '.join(self._text[i:i+wc]) ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width) rdisplay = '{:{width}}'.format(rcontext[:width], width=width) print(ldisplay, rdisplay) def _stem(self, word): return self._stemmer.stem(word).lower()
>>> porter = nltk.PorterStemmer()>>> grail = nltk.corpus.webtext.words('grail.txt')>>> text = IndexedText(porter, grail)>>> text.concordance('lie')r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Wellere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave knot stop our fight ' til each one of you lies dead , and the Holy Grail returns t
2. Word Form Merging
The WordNet word form combiner deletes the words produced by the suffixes, all of which are words in its dictionary. This additional check process slows down:
>>> wnl = nltk.WordNetLemmatizer()>>> [wnl.lemmatize(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond','distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of','government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a','mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical','aquatic', 'ceremony', '.']
If you want to edit some text words or a list of valid words (Central words), WordNet word form combiner is a good choice,