"Natural Language Processing"--on the basis of NLTK to explain the nature of the word? Principles of processing

Last Update:2018-07-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the foregoing

What is the famous???????????????????????

Second, text preprocessing

1, installation NLTK

Pip Install-u NLTK

Installation Corpus (a bunch of conversations, a pair of models)

Import nltknltk.download ()

2. Function List:

3. Text Processing Flow

4. Tokenize the long sentence into a "meaning" part

Importjiebaseg_list= Jieba.cut ("I came to Tsinghua University in North Beijing.", cut_all=True)Print "Full Mode:","/ ". Join (Seg_list)#Full ModeSeg_list = Jieba.cut ("I came to Tsinghua University in North Beijing.", cut_all=False)Print "Default Mode:","/ ". Join (Seg_list)#Precision ModeSeg_list = Jieba.cut ("He came? NetEase Easy hang Research building")#The default is precision modePrint ", ". Join (seg_list) seg_list= Jieba.cut_for_search ("? Xiaoming graduated from the Institute of Chinese Academy of Sciences, after the study in Kyoto, Japan University")#Search engine ModePrint ", ". Join (Seg_list)

Results:

"Full mode": I/Come/North Beijing/Tsinghua/Tsinghua University/Hua da/ University "precise mode": I /Come/North Beijing/ Tsinghua University "new word recognition": He, came, got it,? NetEase Yi, Hang,? Building (here, "hang research" is not in the dictionary , but is also recognized by the Viterbi algorithm) "Search engine mode":? Xiao Ming, Shuo Shi, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in,? Japan, Kyoto, University, Japan, Kyoto University, advanced studies

The tokenize of social network language:

ImportReemoticons_str= R"""(?: [: =;] # eyes [oo\-]? "Nose" [d\] \]\ (\]/\\opp] # mouth)"""Regex_str=[Emoticons_str,r'<[^>]+>',#HTML TagsR'(?: @[\w_]+)',#@ a personR"(?:\ #+[\w_]+[\w\ ' _\-]*[\w_]+)",#Topic TagsR'http[s]?:/ /(?: [a-z]| [0-9]| [[Email protected]&amp;+]| [!*\(\),]| (?:%[0-9a-f][0-9a-f])) +',#URLsR'(?:(?:\ d+,?) +(?:\.? \d+)?)',#DigitalR"(?: [a-z][a-z ' \-_]+[a-z])",#contains-and ' the wordsR'(?: [\w_]+)',#otherR'(?:\ S)' #other]

Regular expression Comparison table
Http://www.regexlab.com/zh/regref.htm

This allows you to handle symbols such as expressions in social languages:

Tokens_re = Re.compile (r'('+'|'. Join (REGEX_STR) +')', Re. VERBOSE |Re. IGNORECASE) Emoticon_re= Re.compile (r'^'+emoticons_str+'$', Re. VERBOSE |Re. IGNORECASE)defTokenize (s):returnTokens_re.findall (s)defPreprocess (S, lowercase=False): Tokens=tokenize (s)ifLowercase:tokens= [TokenifEmoticon_re.search (token)ElseToken.lower () forTokeninchTokens]returnTokenstweet='RT @angelababy: Love baby!:D http://ah.love #168cm'Print(Preprocess (tweet))#[' RT ', ' @angelababy ', ': ', ' love ', ' You ', ' Baby ',#' ! ', ':D ', ' http://ah.love ', ' #168cm ']

5. Normalization of Word morphology

Stemming-----------------to cut off the tail of inflection that doesn't affect part of speech?
Walking cut ing = walk
Walked cut ed = walk
Lemmatization: The transformation of all types of words into a form
Went the return? = Go
Is it? = Be

>>> fromNltk.stem.porterImportPorterstemmer>>> Porter_stemmer =Porterstemmer ()>>>porter_stemmer.stem (' Maximum ') u ' maximum '>>>Porter_stemmer.stem (' presumably ') u ' presum '>>>porter_stemmer.stem (' multiply ') u ' multipli '>>>porter_stemmer.stem (' provision ') U ' Provis '>>> fromNltk.stemImportSnowballstemmer>>> Snowball_stemmer =Snowballstemmer ("中文版")>>>snowball_stemmer.stem (' Maximum ') u ' maximum '>>>Snowball_stemmer.stem (' presumably ') u ' presum '>>> fromNltk.stem.lancasterImportLancasterstemmer>>> Lancaster_stemmer =Lancasterstemmer ()>>>lancaster_stemmer.stem (' Maximum ') ' Maxim '>>>Lancaster_stemmer.stem (' presumably ') ' Presum '>>>Lancaster_stemmer.stem (' presumably ') ' Presum '>>> fromNltk.stem.porterImportPorterstemmer>>> p =Porterstemmer ()>>> P.stem ('went')'went'>>> P.stem ('wenting')'went'

6, part of speech Part-of-speech

>>>ImportNLTK>>> Text = Nltk.word_tokenize ('What does the Fox say')>>>text[' What','does',' the','Fox','say']>>>Nltk.pos_tag (text) [(' What','WDT'), ('does','VBZ'), (' the','DT'), ('Fox','NNS'), ('say','VBP')]

7, Stopwords

? First remember in the console?? Download the thesaurus or nltk.download (' stopwords ')

 from Import Stopwords # First token? a word_list . #  ... # then filter? a filtered_words = for stopwords.words ('中文版')]

8. This pretreatment flow line

Third, natural language processing applications.

In fact preprocessing is the conversion of text to word_list, natural language processing and then into the computer can recognize the language.

There are several applications for natural language processing: Affective analysis, this similarity,? This category

"Natural Language Processing"--on the basis of NLTK to explain the nature of the word? Principles of processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More