NLTK and Jieba These two Python natural language packs (hmm,rnn,sigmoid

Last Update:2017-12-05 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HMM (Hidden Markov model, Hidden Markov models) CRF (Conditional random field, conditional stochastic field),

RNN Deep Learning Algorithm (recurrent neural Networks, cyclic neural network). Input condition continuous LSTM (long short term Memory) The problem can still be learned from the corpus of long-range dependencies, the input conditions are discontinuous, the core is to achieve the DL (T) DH (t) and DL (t+1) DS (t) reverse recursive calculation.

The sigmoid function, which outputs a value between 0 and 1

NLTK and Jieba This two Python natural language package, the former, I mainly on the post-segmentation data analysis, and the latter, I mainly used to the article to participle!

Installation of the Part.1 package

Pip install nltk pip install Jieba

A long road to textual analysis begins!

Part.2 How to use

From here on, I have a brief explanation of the two packages according to the classes I write.

2.1 References

# coding=utf-8 #-*-coding:cp936-*-import jieba import jieba.posseg as PSEG import codecs import re import os import t IME Import string from nltk.probability import freqdist Open=codecs.open

2.2 Introduction of custom dictionaries and discontinued words

Custom dictionaries are we in the time of the word to avoid the phrase we need to be divided into small words and import, and stop the word, it is our word in the process, will be the word of our word in the process of the interference of the dictionary excluded.

#jieba participle can import our custom dictionaries, the format "word" "part of Speech" "Frequency" jieba.load_userdict (' Data/userdict.txt ')

#定义一个keyword类 class Keyword (object): Def chinese_stopwords (self): #导入停用词库 stopword=[] Cfp=ope N (' data/stopword.txt ', ' r+ ', ' utf-8 ') #停用词的txt文件 for line in Cfp:for Word in Line.split (): Stopword.append (Word) cfp.close () return Stopword

2.3 Participle and selection

First of all to talk about participle, participle can be said to be the first step of Chinese article analysis, Chinese is not like English, each word has a space between, we first need to the article segmentation, that is, the long text cut into a single word, and then to follow the operation of these words.

def word_cut_list (self,word_str): # Use regular expressions to remove some symbols such as punctuation. word_str = re.sub (R ' \s+ ', ', word_str) # trans space to space &NBSP;&NBSP;&N bsp; word_str = re.sub (R ' \n+ ', ', word_str) # trans to Space ; word_str = re.sub (R ' \t+ ', ', word_str) # trans Tab to space W Ord_str = Re.sub ("[\s+\.\!\/_,$%^* (+\" \ ']+|[ +--;! ，”。《》，。：“？ , [email protected]#￥%......&* () 1234567①②③④)]+ ". \ decode ("UTF8") , "". Decode ("UTF8"), word_str)

wordlist = List (Jieba.cut (WORD_STR)) #jieba .cut cut the string into words and add it to a list wordlist_n = [] chinese_ Stopwords=self. Chinese_stopwords () for Word in wordlist: If Word isn't in chinese_stopwords: #词语的清洗: Go to stop word if Word! = ' \ r \ n ' and word!= ' and word! = ' \u3000 '. D Ecode (' unicode_escape ') \ word!= ' \xa0 '. Decode (' Unicode_escape '): #词语的清洗: Go to full-width space Wordlist_n.append (word) return wordlIst_n

What do you call a pick? In fact, when we do the analysis of Chinese text, not every word is useful. What kind of words can express the meaning of the article? For example: noun! How to extract the noun???????????

def word_pseg (self,word_str): # noun extraction function words = Pseg.cut (word_str) word_list = [] for WDS in words: # filter the words in the custom dictionary, and the various nouns, The word of the custom thesaurus defaults to the X-part of speech when no speech is set, that is, the word's flag part is x if Wds.flag = = ' x ' and Wds.word! = ' and Wds.word! = ' ns ' \ &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NB sp; or Re.match (R ' ^n ', wds.flag)! = None \ and Re.match (R ' ^nr ', wds.flag) = = None: word_list.append (Wds.word) &NBSP;&NBSP;&NBSP;&NBsp; return word_list

2.4 Sorting and running

Previously, we have a certain analysis of the word segmentation and after the selection of participle, then how to get this list we have to analyze it? Simple is the first statistical word frequency, such an analysis, there is the word frequency, we naturally think of the sort.

def sort_item (Self,item): #排序函数, positive order vocab=[] for k,v in Item:vocab.append ((k,v)) List =list (sorted (Vocab,key=lambda v:v[1],reverse=1)) return list

def Run (self): Apage=open (self.filename, ' r+ ', ' Utf-8 ') word=apage.read () #先读取整篇文章 Wordp=self. Word_pseg (Word) #对整篇文章进行词性的挑选 new_str= ". Join (WORDP) wordlist=self. Word_cut_list (NEW_STR) #对挑选后的文章进行分词 apage.close () return Wordlist

def __init__ (self, filename): Self.filename = filename

Reading analysis of 2.5 main function

Watching so much, we still useless to nltk this bag, then here, we are going to start using this package! Here I mainly on the folder of some of the files in the analysis, and then the key words of each file are counted. That is, simple word frequency statistic sort, output. Initially my idea, I only output the first 10 keywords, but after many experiments, I think the fixed number of keywords, to our research is very biased, then I use the percentage to output keywords.

If __name__== ' __main__ ': b_path = ' Data/all ' a_path = ' Data/result ' & nbsp Roots = Os.listdir (b_path) alltime_s = Time.time () for filename in roots: &nbs p; starttime = time.time () kw = Keyword (b_ Path + '/' + filename) wl = kw. Run () fdist = freqdist (WL) Sum = Len (WL) pre = 0 fn = open (A_ Path + '/' + filename, ' w+ ', ' utf-8 ') fn.write (' sum: ' + str (sum) + ' \ r \ n ') &nbs p; for (s, N) in Kw.sort_item (Fdist.items ()): Fn.write (s + str (float (n)/Sum) + "&NBSP;&NBSP;&NBSP;&NBSP;&NBsp +STR (n) + ' \ r \ n ') pre = pre + float (n)/Sum &NBSP;&N bsp; if pre > 0.5: Fn.write (str (PRE)) fn.close () break Endtime = Time.time () print filename + ' finish time: ' + STR (endtime-starttime)

Print "Total elapsed:" + str (time.time ()-alltime_s)

NLTK and Jieba These two Python natural language packs (hmm,rnn,sigmoid

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More