Natural Language Processing 3.6-normalized text, natural language processing 3.6

Source: Internet
Author: User
Tags nltk

Natural Language Processing 3.6-normalized text, natural language processing 3.6

In the previous example, the text is often converted into lowercase letters before being processed, that is, (w. lower () for w in words ). use lower () to normalize text to lowercase, so that The difference between "the" and "The" is ignored.

We often make more attempts, such as removing all the Suffixes in the text and extracting the stem tasks. The next step is to ensure that the result form is the word identified in the dictionary, that is, the word form merge task. First, define the data used in this section.

>>>raw="""DENNIS: Listen, strange women lying in ponds distributing swordsis no basis for a system of government. Supreme executive power derives froma mandate from the masses, not from some farcical aquatic ceremony.""">>>tokens=nltk.word_tokenize(raw)

1. Stem extraction device

NLTK includes a ready-made stem Extraction Tool. If you want to use a stem Extraction Tool, you should first use one of them instead of using regular expressions to create your own stem Extraction Tool, because NLTK stem extractors can handle a wide range of irregular situations. Porter and Lancaster stem extractors strip suffixes according to their rules. The following example shows that Porter correctly processes lying (ing it to lie), while Lancaster does not.

>>>import nltk>>>porter=nltk.PorterStemmer()>>>lancaster=nltk.LancasterStemmer()>>>print([porter.stem(t) for t in tokens])['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']>>>print([lancaster.stem(t) for t in tokens])['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut','sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem','execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not','from', 'som', 'farc', 'aqu', 'ceremony', '.']

The process of stem extraction is not clearly defined. Generally, a stem extraction device suitable for application is selected. If you want to index text or make search support different vocabulary forms, Porter stem extraction is a good choice.

class IndexedText(object):    def __init__(self, stemmer, text):        self._text = text        self._stemmer = stemmer        self._index = nltk.Index((self._stem(word), i)                                 for (i, word) in enumerate(text))    def concordance(self, word, width=40):        key = self._stem(word)        wc = int(width/4)                # words of context        for i in self._index[key]:            lcontext = ' '.join(self._text[i-wc:i])            rcontext = ' '.join(self._text[i:i+wc])            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)            print(ldisplay, rdisplay)    def _stem(self, word):        return self._stemmer.stem(word).lower()


>>> porter = nltk.PorterStemmer()>>> grail = nltk.corpus.webtext.words('grail.txt')>>> text = IndexedText(porter, grail)>>> text.concordance('lie')r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Wellere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave knot stop our fight ' til each one of you lies dead , and the Holy Grail returns t

2. Word Form Merging

The WordNet word form combiner deletes the words produced by the suffixes, all of which are words in its dictionary. This additional check process slows down:

>>> wnl = nltk.WordNetLemmatizer()>>> [wnl.lemmatize(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond','distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of','government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a','mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical','aquatic', 'ceremony', '.']

If you want to edit some text words or a list of valid words (Central words), WordNet word form combiner is a good choice,

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.