"Stove-refining AI" machine learning 036-NLP-word reduction

Last Update:2018-10-09 Source: Internet

Author: User

Tags print format nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

"Stove-refining AI" machine learning 036-NLP-word reduction
-

(Python libraries and version numbers used in this article: Python 3.6, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2, NLTK 3.3)

Word reduction is also the words converted to the original appearance, and the previous article described in the stem extraction is not the same, word reduction is more difficult, it is a more structured approach, in the previous article in the stemming example, you can see the wolves extracted as wolv, which is certainly not what we expected. So here we use the NLP method to restore English words.

<br/>

# # 1. NLP Morphology reduction

Word reduction is a dictionary-based mapping, and, NLTK also requires manual notation, otherwise it may be inaccurate, so in the process of natural language processing, the text is first participle, and then the part of speech, and then the morphological reduction.

"' Python
# words to be restored
words = [' table ', ' probably ', ' wolves ', ' playing ', ' is ',
' Dog ', ' the ', ' beaches ', ' grounded ', ' dreamt ', ' envision ']

# since word-of-speech needs to be labeled first, here we test with both noun and verb speech
Lemmatizers = [' NOUN lemmatizer ', ' VERB Lemmatizer '] # two parts of speech
Lemmatizer = Wordnetlemmatizer ()
Formatted_row = ' {: >24} ' * (Len (lemmatizers) + 1) # makes its print format consistent
Print (Formatted_row.format (' WORD ', *lemmatizers)) # Prints the table header

For word in words: # # Each word is transformed
Lemmatized=[lemmatizer.lemmatize (Word, pos= ' n '), lemmatizer.lemmatize (Word, pos= ' V ')]
# Note that the POS denotes the part of speech, respectively, the name and verb
Print (Formatted_row.format (word,*lemmatized)) # Unpacking the extracted stem
```

* *-------------------------------------lost-----------------------------------------* *

WORD NOUN Lemmatizer VERB Lemmatizer
Table Table Table
Probably probably probably
Wolves Wolf Wolves
Playing playing play
IS was be
Dog Dog Dog
The The The
Beaches Beach Beach
Grounded grounded ground
Dreamt Dreamt Dream
Envision Envision Envision

* *--------------------------------------------finished-------------------------------------* *

As can be seen, although Wordnetlemmatizer is very useful, but need to judge the part of speech in advance, and the speech as a parameter into the function, then it is possible to automatically determine the parts of speech, do not need we think of judgment? This is for sure, NLTK has a function Pos_tag can automatically determine the part of speech of a word, so based on this, we can write a function to automatically deal with an entire sentence, the output of speech after the form of the restoration. The following code:

"' Python
# define a function to restore all the words in a sentence
def lemmatize_all (sentence):
WNL = Wordnetlemmatizer ()
For word, tagged in pos_tag (word_tokenize (sentence)):
If Tag.startswith (' NN '):
Yield Wnl.lemmatize (Word, pos= ' n ')
Elif tag.startswith (' VB '):
Yield Wnl.lemmatize (Word, pos= ' V ')
Elif tag.startswith (' JJ '):
Yield Wnl.lemmatize (Word, pos= ' a ')
Elif tag.startswith (' R '):
Yield Wnl.lemmatize (Word, pos= ' r ')
Else
Yield word

```

"' Python
Text = ' dog runs, cats drunk wines, chicken eat apples, foxes jumped-meters '
Print ('/'. Join (Lemmatize_all (text)))
```

* *-------------------------------------lost-----------------------------------------* *

Dog/run/,/cat/drink/wine/,/chicken/eat/apple/,/fox/jump/two/meter

* *--------------------------------------------finished-------------------------------------* *

It can be seen that the plural of the noun in the sentence is changed to the singular, the verb past tense changes to the present tense and so on.

About the Pos_tag () function, the word and part of speech tags are returned, which are:

"' Python
# NLTK POS Tags:
CC conjunctions and, Or,but, if, While,although
CD numerals Twenty-four, fourth, 1991,14:24
DT qualifier Word The, a, some, most,every, no
EX presence quantifier There, there ' s
FW loanwords Dolce, Ersatz, Esprit, quo,maitre
In preposition conjunctions on, Of,at, With,by,into, under
JJ adjectives New,good, high, special, big, local
JJR comparative words bleaker braver breezier briefer brighter Brisker
JJS superlative words calmest cheapest choicest classiest cleanest clearest
LS Mark A. b B. C c. D E F First G H I J K
MD modal verb can cannot could couldn ' t
NN noun year,home, costs, time, education
NNS noun plural undergraduates scotches
NNP proper noun Alison,africa,april,washington
Nnps proper noun plural Americans americas amharas amityvilles
PDT Pre-qualifier all both half many
POS all lattice mark ' s
PRP personal pronouns hers herself him himself hisself
Prp$ his mine my ours
RB adverb occasionally unabatingly maddeningly
RBR adverb comparative grade further gloomier grander
RBS adverb superlative best biggest bluntest earliest
RP function Word aboard about across along apart
SYM Symbol% & "". ) )
to Word to
UH exclamation words Goodbye Goody Gosh Wow
VB verb ask Assemble assess
VBD verb past tense dipped pleaded swiped
VBG verb now participle telegraphing stirring focusing
VBN verb past participle multihulled dilapidated aerosolized
VBP verb present-style non-third-person tense predominate wrap resort sue
VBZ verb present-style third-person tense bases reconstructs marks
WDT WH Qualifier Word who,which,when,what,where,how
WP WH pronoun what whatever
wp$ wh pronoun all lattice whose
WRB WH Adverbs

```

**\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\ #小 \*\*\*\*\*\*\*\*\*\* Knot \#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\# \#\#\#\#\#\#\#\#\#\#\#\#**

**1,NLP can use the Wordnetlemmatizer function, but this function needs to determine the part of speech of a word before it is used, and then enter the speech as a parameter into the function to convert. **

**2, in order to simplify the process of human judgment, NLTK has its own part of speech judgment function Pos_tag, this function can automatically output a word of the part of speech, so the Pos_tag and Wordnetlemmatizer function together, you can automatically to a whole paragraph of text word, Operations such as word reduction. **

**\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\ #\#\#\#\#\#\#\#\#\#\#**

<br/>

Note: This section of the code has been uploaded ([* * * my github**] (https://github.com/RayDean/MachineLearning)), Welcome to download.

Resources:

1, Python machine learning classic example, Prateek Joshi, Tao Junjie, Chen Xiaoli translation

"Stove-refining AI" machine learning 036-NLP-word reduction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More