Natural Language Processing 3.6-normalized text, natural language processing 3.6

Last Update:2016-10-22 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Natural Language Processing 3.6-normalized text, natural language processing 3.6

In the previous example, the text is often converted into lowercase letters before being processed, that is, (w. lower () for w in words ). use lower () to normalize text to lowercase, so that The difference between "the" and "The" is ignored.

We often make more attempts, such as removing all the Suffixes in the text and extracting the stem tasks. The next step is to ensure that the result form is the word identified in the dictionary, that is, the word form merge task. First, define the data used in this section.

>>>raw="""DENNIS: Listen, strange women lying in ponds distributing swordsis no basis for a system of government. Supreme executive power derives froma mandate from the masses, not from some farcical aquatic ceremony.""">>>tokens=nltk.word_tokenize(raw)

1. Stem extraction device

NLTK includes a ready-made stem Extraction Tool. If you want to use a stem Extraction Tool, you should first use one of them instead of using regular expressions to create your own stem Extraction Tool, because NLTK stem extractors can handle a wide range of irregular situations. Porter and Lancaster stem extractors strip suffixes according to their rules. The following example shows that Porter correctly processes lying (ing it to lie), while Lancaster does not.

>>>import nltk>>>porter=nltk.PorterStemmer()>>>lancaster=nltk.LancasterStemmer()>>>print([porter.stem(t) for t in tokens])['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond','distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern','.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from','the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']>>>print([lancaster.stem(t) for t in tokens])['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut','sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem','execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not','from', 'som', 'farc', 'aqu', 'ceremony', '.']

The process of stem extraction is not clearly defined. Generally, a stem extraction device suitable for application is selected. If you want to index text or make search support different vocabulary forms, Porter stem extraction is a good choice.

class IndexedText(object):    def __init__(self, stemmer, text):        self._text = text        self._stemmer = stemmer        self._index = nltk.Index((self._stem(word), i)                                 for (i, word) in enumerate(text))    def concordance(self, word, width=40):        key = self._stem(word)        wc = int(width/4)                # words of context        for i in self._index[key]:            lcontext = ' '.join(self._text[i-wc:i])            rcontext = ' '.join(self._text[i:i+wc])            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)            print(ldisplay, rdisplay)    def _stem(self, word):        return self._stemmer.stem(word).lower()

>>> porter = nltk.PorterStemmer()>>> grail = nltk.corpus.webtext.words('grail.txt')>>> text = IndexedText(porter, grail)>>> text.concordance('lie')r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Wellere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave knot stop our fight ' til each one of you lies dead , and the Holy Grail returns t

2. Word Form Merging

The WordNet word form combiner deletes the words produced by the suffixes, all of which are words in its dictionary. This additional check process slows down:

>>> wnl = nltk.WordNetLemmatizer()>>> [wnl.lemmatize(t) for t in tokens]['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond','distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of','government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a','mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical','aquatic', 'ceremony', '.']

If you want to edit some text words or a list of valid words (Central words), WordNet word form combiner is a good choice,

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More