First, the foregoing
What is the famous???????????????????????
Second, text preprocessing
1, installation NLTK
Pip Install-u NLTK
Installation Corpus (a bunch of conversations, a pair of models)
Import nltknltk.download ()
2. Function List:
3. Text Processing Flow
4. Tokenize the long sentence into a "meaning" part
Importjiebaseg_list= Jieba.cut ("I came to Tsinghua University in North Beijing.", cut_all=True)Print "Full Mode:","/ ". Join (Seg_list)#Full ModeSeg_list = Jieba.cut ("I came to Tsinghua University in North Beijing.", cut_all=False)Print "Default Mode:","/ ". Join (Seg_list)#Precision ModeSeg_list = Jieba.cut ("He came? NetEase Easy hang Research building")#The default is precision modePrint ", ". Join (seg_list) seg_list= Jieba.cut_for_search ("? Xiaoming graduated from the Institute of Chinese Academy of Sciences, after the study in Kyoto, Japan University")#Search engine ModePrint ", ". Join (Seg_list)
Results:
"Full mode": I/Come/North Beijing/Tsinghua/Tsinghua University/Hua da/ University "precise mode": I /Come/North Beijing/ Tsinghua University "new word recognition": He, came, got it,? NetEase Yi, Hang,? Building (here, "hang research" is not in the dictionary , but is also recognized by the Viterbi algorithm) "Search engine mode":? Xiao Ming, Shuo Shi, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in,? Japan, Kyoto, University, Japan, Kyoto University, advanced studies
The tokenize of social network language:
ImportReemoticons_str= R"""(?: [: =;] # eyes [oo\-]? "Nose" [d\] \]\ (\]/\\opp] # mouth)"""Regex_str=[Emoticons_str,r'<[^>]+>',#HTML TagsR'(?: @[\w_]+)',#@ a personR"(?:\ #+[\w_]+[\w\ ' _\-]*[\w_]+)",#Topic TagsR'http[s]?:/ /(?: [a-z]| [0-9]| [[Email protected]&+]| [!*\(\),]| (?:%[0-9a-f][0-9a-f])) +',#URLsR'(?:(?:\ d+,?) +(?:\.? \d+)?)',#DigitalR"(?: [a-z][a-z ' \-_]+[a-z])",#contains-and ' the wordsR'(?: [\w_]+)',#otherR'(?:\ S)' #other]
Regular expression Comparison table
Http://www.regexlab.com/zh/regref.htm
This allows you to handle symbols such as expressions in social languages:
Tokens_re = Re.compile (r'('+'|'. Join (REGEX_STR) +')', Re. VERBOSE |Re. IGNORECASE) Emoticon_re= Re.compile (r'^'+emoticons_str+'$', Re. VERBOSE |Re. IGNORECASE)defTokenize (s):returnTokens_re.findall (s)defPreprocess (S, lowercase=False): Tokens=tokenize (s)ifLowercase:tokens= [TokenifEmoticon_re.search (token)ElseToken.lower () forTokeninchTokens]returnTokenstweet='RT @angelababy: Love baby!:D http://ah.love #168cm'Print(Preprocess (tweet))#[' RT ', ' @angelababy ', ': ', ' love ', ' You ', ' Baby ',#' ! ', ':D ', ' http://ah.love ', ' #168cm ']
5. Normalization of Word morphology
Stemming-----------------to cut off the tail of inflection that doesn't affect part of speech?
Walking cut ing = walk
Walked cut ed = walk
Lemmatization: The transformation of all types of words into a form
Went the return? = Go
Is it? = Be
>>> fromNltk.stem.porterImportPorterstemmer>>> Porter_stemmer =Porterstemmer ()>>>porter_stemmer.stem (' Maximum ') u ' maximum '>>>Porter_stemmer.stem (' presumably ') u ' presum '>>>porter_stemmer.stem (' multiply ') u ' multipli '>>>porter_stemmer.stem (' provision ') U ' Provis '>>> fromNltk.stemImportSnowballstemmer>>> Snowball_stemmer =Snowballstemmer ("中文版")>>>snowball_stemmer.stem (' Maximum ') u ' maximum '>>>Snowball_stemmer.stem (' presumably ') u ' presum '>>> fromNltk.stem.lancasterImportLancasterstemmer>>> Lancaster_stemmer =Lancasterstemmer ()>>>lancaster_stemmer.stem (' Maximum ') ' Maxim '>>>Lancaster_stemmer.stem (' presumably ') ' Presum '>>>Lancaster_stemmer.stem (' presumably ') ' Presum '>>> fromNltk.stem.porterImportPorterstemmer>>> p =Porterstemmer ()>>> P.stem ('went')'went'>>> P.stem ('wenting')'went'
6, part of speech Part-of-speech
>>>ImportNLTK>>> Text = Nltk.word_tokenize ('What does the Fox say')>>>text[' What','does',' the','Fox','say']>>>Nltk.pos_tag (text) [(' What','WDT'), ('does','VBZ'), (' the','DT'), ('Fox','NNS'), ('say','VBP')]
7, Stopwords
? First remember in the console?? Download the thesaurus or nltk.download (' stopwords ')
from Import Stopwords # First token? a word_list . # ... # then filter? a filtered_words = for stopwords.words ('中文版')]
8. This pretreatment flow line
Third, natural language processing applications.
In fact preprocessing is the conversion of text to word_list, natural language processing and then into the computer can recognize the language.
There are several applications for natural language processing: Affective analysis, this similarity,? This category
"Natural Language Processing"--on the basis of NLTK to explain the nature of the word? Principles of processing