NLTK and Jieba These two Python natural language packs (hmm,rnn,sigmoid

Source: Internet
Author: User
Tags nltk

HMM (Hidden Markov model, Hidden Markov models) CRF (Conditional random field, conditional stochastic field),

RNN Deep Learning Algorithm (recurrent neural Networks, cyclic neural network). Input condition continuous LSTM (long short term Memory) The problem can still be learned from the corpus of long-range dependencies, the input conditions are discontinuous, the core is to achieve the DL (T) DH (t) and DL (t+1) DS (t) reverse recursive calculation.

The sigmoid function, which outputs a value between 0 and 1

NLTK and Jieba This two Python natural language package, the former, I mainly on the post-segmentation data analysis, and the latter, I mainly used to the article to participle!

Installation of the Part.1 package

Pip install nltk pip install Jieba

A long road to textual analysis begins!

Part.2 How to use

From here on, I have a brief explanation of the two packages according to the classes I write.

2.1 References

# coding=utf-8 #-*-coding:cp936-*-import jieba import jieba.posseg as PSEG import codecs import re import os import t IME Import string from nltk.probability import freqdist Open=codecs.open

2.2 Introduction of custom dictionaries and discontinued words

Custom dictionaries are we in the time of the word to avoid the phrase we need to be divided into small words and import, and stop the word, it is our word in the process, will be the word of our word in the process of the interference of the dictionary excluded.

#jieba participle can import our custom dictionaries, the format "word" "part of Speech" "Frequency" jieba.load_userdict (' Data/userdict.txt ')

#定义一个keyword类 class Keyword (object): Def chinese_stopwords (self): #导入停用词库 stopword=[] Cfp=ope                 N (' data/stopword.txt ', ' r+ ', ' utf-8 ') #停用词的txt文件 for line in Cfp:for Word in Line.split (): Stopword.append (Word) cfp.close () return Stopword

2.3 Participle and selection

First of all to talk about participle, participle can be said to be the first step of Chinese article analysis, Chinese is not like English, each word has a space between, we first need to the article segmentation, that is, the long text cut into a single word, and then to follow the operation of these words.

    def word_cut_list (self,word_str):         # Use regular expressions to remove some symbols such as punctuation.         word_str = re.sub (R ' \s+ ', ', word_str)   # trans space to space   &N bsp;     word_str = re.sub (R ' \n+ ', ', word_str)   # trans to Space       ;   word_str = re.sub (R ' \t+ ', ', word_str)   # trans Tab to space         W Ord_str = Re.sub ("[\s+\.\!\/_,$%^* (+\" \ ']+|[ +--;! ,”。 《》,。 :“? , [email protected]#¥%......&* () 1234567①②③④)]+ ". \                            decode ("UTF8") , "". Decode ("UTF8"), word_str)

        wordlist = List (Jieba.cut (WORD_STR)) #jieba .cut  cut the string into words and add it to a list         wordlist_n = []         chinese_ Stopwords=self. Chinese_stopwords ()         for Word in wordlist:              If Word isn't in chinese_stopwords: #词语的清洗: Go to stop word                  if Word! = ' \ r \ n '   and word!= ' and word! = ' \u3000 '. D Ecode (' unicode_escape ') \                          word!= ' \xa0 '. Decode (' Unicode_escape '): #词语的清洗: Go to full-width space                      Wordlist_n.append (word)         return wordlIst_n

What do you call a pick? In fact, when we do the analysis of Chinese text, not every word is useful. What kind of words can express the meaning of the article? For example: noun! How to extract the noun???????????

    def word_pseg (self,word_str):  # noun extraction function         words = Pseg.cut (word_str)         word_list = []          for WDS in words:             # filter the words in the custom dictionary, and the various nouns, The word of the custom thesaurus defaults to the X-part of speech when no speech is set, that is, the word's flag part is x             if Wds.flag = = ' x ' and Wds.word! = ' and Wds.word! = ' ns ' \           &NB sp;         or Re.match (R ' ^n ', wds.flag)! = None \                               and Re.match (R ' ^nr ', wds.flag) = = None:                  word_list.append (Wds.word)    &NBsp;    return word_list

2.4 Sorting and running

Previously, we have a certain analysis of the word segmentation and after the selection of participle, then how to get this list we have to analyze it? Simple is the first statistical word frequency, such an analysis, there is the word frequency, we naturally think of the sort.

def sort_item (Self,item): #排序函数, positive order vocab=[] for k,v in Item:vocab.append ((k,v)) List =list (sorted (Vocab,key=lambda v:v[1],reverse=1)) return list

def Run (self): Apage=open (self.filename, ' r+ ', ' Utf-8 ') word=apage.read () #先读取整篇文章 Wordp=self. Word_pseg (Word) #对整篇文章进行词性的挑选 new_str= ". Join (WORDP) wordlist=self. Word_cut_list (NEW_STR) #对挑选后的文章进行分词 apage.close () return Wordlist

def __init__ (self, filename): Self.filename = filename

Reading analysis of 2.5 main function

Watching so much, we still useless to nltk this bag, then here, we are going to start using this package! Here I mainly on the folder of some of the files in the analysis, and then the key words of each file are counted. That is, simple word frequency statistic sort, output. Initially my idea, I only output the first 10 keywords, but after many experiments, I think the fixed number of keywords, to our research is very biased, then I use the percentage to output keywords.

If __name__== ' __main__ ':     b_path = ' Data/all '     a_path = ' Data/result '   & nbsp Roots = Os.listdir (b_path)     alltime_s = Time.time ()     for filename in roots:  &nbs p;      starttime = time.time ()         kw = Keyword (b_ Path + '/' + filename)         wl = kw. Run ()         fdist = freqdist (WL)         Sum = Len (WL)         pre = 0         fn = open (A_ Path + '/' + filename, ' w+ ', ' utf-8 ')         fn.write (' sum: ' + str (sum) + ' \ r \ n ') &nbs p;       for (s, N) in Kw.sort_item (Fdist.items ()):              Fn.write (s + str (float (n)/Sum) + "    &NBsp +STR (n) + ' \ r \ n ')             pre = pre + float (n)/Sum  &N bsp;          if pre > 0.5:                  Fn.write (str (PRE))                  fn.close ()                  break         Endtime = Time.time ()         print filename + '        finish time: ' + STR (endtime-starttime)

Print "Total elapsed:" + str (time.time ()-alltime_s)

NLTK and Jieba These two Python natural language packs (hmm,rnn,sigmoid

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.