"Natural Language Processing"--on the basis of NLTK to explain the nature of the word? Principles of processing

Source: Internet
Author: User
Tags comparison table html tags nltk

First, the foregoing

What is the famous???????????????????????

Second, text preprocessing

1, installation NLTK

Pip Install-u NLTK

Installation Corpus (a bunch of conversations, a pair of models)

Import nltknltk.download ()

2. Function List:

3. Text Processing Flow

4. Tokenize the long sentence into a "meaning" part

Importjiebaseg_list= Jieba.cut ("I came to Tsinghua University in North Beijing.", cut_all=True)Print "Full Mode:","/ ". Join (Seg_list)#Full ModeSeg_list = Jieba.cut ("I came to Tsinghua University in North Beijing.", cut_all=False)Print "Default Mode:","/ ". Join (Seg_list)#Precision ModeSeg_list = Jieba.cut ("He came? NetEase Easy hang Research building")#The default is precision modePrint ", ". Join (seg_list) seg_list= Jieba.cut_for_search ("? Xiaoming graduated from the Institute of Chinese Academy of Sciences, after the study in Kyoto, Japan University")#Search engine ModePrint ", ". Join (Seg_list)

Results:

"Full mode": I/Come/North Beijing/Tsinghua/Tsinghua University/Hua da/ University "precise mode": I /Come/North Beijing/ Tsinghua University "new word recognition": He, came, got it,? NetEase Yi, Hang,? Building (here, "hang research" is not in the dictionary , but is also recognized by the Viterbi algorithm) "Search engine mode":? Xiao Ming, Shuo Shi, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in,? Japan, Kyoto, University, Japan, Kyoto University, advanced studies

The tokenize of social network language:

ImportReemoticons_str= R"""(?: [: =;] # eyes [oo\-]? "Nose" [d\] \]\ (\]/\\opp] # mouth)"""Regex_str=[Emoticons_str,r'<[^>]+>',#HTML TagsR'(?: @[\w_]+)',#@ a personR"(?:\ #+[\w_]+[\w\ ' _\-]*[\w_]+)",#Topic TagsR'http[s]?:/ /(?: [a-z]| [0-9]| [[Email protected]&amp;+]| [!*\(\),]| (?:%[0-9a-f][0-9a-f])) +',#URLsR'(?:(?:\ d+,?) +(?:\.? \d+)?)',#DigitalR"(?: [a-z][a-z ' \-_]+[a-z])",#contains-and ' the wordsR'(?: [\w_]+)',#otherR'(?:\ S)' #other]

Regular expression Comparison table
Http://www.regexlab.com/zh/regref.htm

This allows you to handle symbols such as expressions in social languages:

Tokens_re = Re.compile (r'('+'|'. Join (REGEX_STR) +')', Re. VERBOSE |Re. IGNORECASE) Emoticon_re= Re.compile (r'^'+emoticons_str+'$', Re. VERBOSE |Re. IGNORECASE)defTokenize (s):returnTokens_re.findall (s)defPreprocess (S, lowercase=False): Tokens=tokenize (s)ifLowercase:tokens= [TokenifEmoticon_re.search (token)ElseToken.lower () forTokeninchTokens]returnTokenstweet='RT @angelababy: Love baby!:D http://ah.love #168cm'Print(Preprocess (tweet))#[' RT ', ' @angelababy ', ': ', ' love ', ' You ', ' Baby ',#' ! ', ':D ', ' http://ah.love ', ' #168cm ']

5. Normalization of Word morphology

Stemming-----------------to cut off the tail of inflection that doesn't affect part of speech?
Walking cut ing = walk
Walked cut ed = walk
Lemmatization: The transformation of all types of words into a form
Went the return? = Go
Is it? = Be

>>> fromNltk.stem.porterImportPorterstemmer>>> Porter_stemmer =Porterstemmer ()>>>porter_stemmer.stem (' Maximum ') u ' maximum '>>>Porter_stemmer.stem (' presumably ') u ' presum '>>>porter_stemmer.stem (' multiply ') u ' multipli '>>>porter_stemmer.stem (' provision ') U ' Provis '>>> fromNltk.stemImportSnowballstemmer>>> Snowball_stemmer =Snowballstemmer ("中文版")>>>snowball_stemmer.stem (' Maximum ') u ' maximum '>>>Snowball_stemmer.stem (' presumably ') u ' presum '>>> fromNltk.stem.lancasterImportLancasterstemmer>>> Lancaster_stemmer =Lancasterstemmer ()>>>lancaster_stemmer.stem (' Maximum ') ' Maxim '>>>Lancaster_stemmer.stem (' presumably ') ' Presum '>>>Lancaster_stemmer.stem (' presumably ') ' Presum '>>> fromNltk.stem.porterImportPorterstemmer>>> p =Porterstemmer ()>>> P.stem ('went')'went'>>> P.stem ('wenting')'went'

6, part of speech Part-of-speech

>>>ImportNLTK>>> Text = Nltk.word_tokenize ('What does the Fox say')>>>text[' What','does',' the','Fox','say']>>>Nltk.pos_tag (text) [(' What','WDT'), ('does','VBZ'), (' the','DT'), ('Fox','NNS'), ('say','VBP')]

7, Stopwords

? First remember in the console?? Download the thesaurus or nltk.download (' stopwords ')
 from Import Stopwords # First token? a word_list . #  ... # then filter? a filtered_words = for stopwords.words ('中文版')]

8. This pretreatment flow line

Third, natural language processing applications.

In fact preprocessing is the conversion of text to word_list, natural language processing and then into the computer can recognize the language.

There are several applications for natural language processing: Affective analysis, this similarity,? This category

"Natural Language Processing"--on the basis of NLTK to explain the nature of the word? Principles of processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.