Python stutter (Jieba) participle

Last Update:2017-06-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the characteristics

1, support three kinds of participle mode:
(1) Precise mode: Try to cut the sentence most precisely, suitable for text analysis.
(2) Full mode: All words in the sentence can be scanned out, the speed is very fast, but can not solve the ambiguity.
(3) Search engine mode: On the basis of accurate mode, the long word again segmentation, improve the recall rate, suitable for search engine participle.
2. Support Traditional participle
3. Support Custom Dictionaries

Second, the realization

The realization principle of stuttering participle mainly has three points:
(1) A directed acyclic graph (DAG) composed of all possible words of Chinese characters in the sentence is generated by using the trie tree structure to achieve efficient word-map scanning.
(2) Using dynamic programming to find the maximum probability path, find out the maximum segmentation combination based on the word frequency.
(3) for the non-login words, using the HMM model based on the Chinese characters ' word-forming ability, the Viterbi algorithm is used.

Third, the application

Let's demonstrate the main function of stuttering participle

1. Participle

1 #-*-coding:utf-8-*-2  3  4 Import Jieba 5  6  7  8 "' 9 Cut method has two parameters 10 1) The first parameter is the string we want to participle 11 2) The second argument cut_a ll is used to control whether to use the full model of the "#全模式15 word_list = jieba.cut (" The weather is really good today. Honey, let's go hiking! ", cut_all=true) print" Full mode: "," | ". Join (word_list) #精确模式, the default is the exact mode word_list = Jieba.cut ("It's a nice day today.") Honey, let's go hiking! ", Cut_all=false) print" exact mode: "," | ". Join (word_list) #搜索引擎模式21 word_list = Jieba.cut_for_search ("It's a nice day. Honey, let's go hiking! ") print" search engine: "," | ". Join (Word_list)

2. Add a custom dictionary

Although Jieba has the ability to recognize new words, adding new words can guarantee a higher rate of correctness.
Developers can add a custom dictionary to their needs to include words that are not in the Jieba thesaurus.
Example: Little Red today we go to the places we used to go hiking? Why don't we change places! How about the Park garden? No problem. Small bean sprouts

Custom Dictionaries (Cu.txt):
Garden Park 5
Small bean sprouts 3 nr

A word occupies a line; each line is divided into three parts, one for words, the other for word frequency, and finally for part of speech (can be omitted), separated by a space in the middle.

1 #-*-coding:utf-8-*-2 import jieba3 4 jieba.load_userdict ("./cu.txt") 5 word_list = Jieba.cut ("Little Red" do we go hiking in places we used to go? Why don't we change places! How about the garden? Small bean sprouts ") 6 print" | ". Join (Word_list)

3. Keyword Extraction

1) The first parameter (setence) is the text to be extracted.
2) TOPK to return several TF/IDF weights the biggest keyword, the default value is 20, you can specify.

1 #-*-coding:utf-8-*-2 import Jieba.analyse as Al3 4 content = open ("./topk.txt", "RB"). Read () 5 WORD_TOPK = Al.extract_ta GS (CONTENT,TOPK=4) 6 print "|". Join (WORD_TOPK)

4. POS Tagging

The part of speech of each word after sentence segmentation is marked by the Ictclas compatible notation.

1 #-*-coding:utf-8-*-2 import jieba.posseg as pseg3 4 words = Pseg.cut ("Qingdao Beijing is a nice place") 5 for word in words:6     print word . Word,word.flag

Operation Result:
Qingdao NS
BEIJING NS
Is V
Nice A
The UJ
Place n

5. Parallel participle (can only be run on Linux system)

The text that will be participle is delimited by line, the lines of text are assigned to multiple Python processes, and then the result is merged, thus increasing the word speed.
The Python-based multiprocessing module does not currently support Windows systems.

#-*-coding:utf-8-*-import jieba# turn on parallel word segmentation, parameters are the number of processes participating in parallel participle jieba.enable_parallel (2) #关闭并行分词 #jieba.disable_parallel () Content = Open ("./topk.txt", "RB") words = jieba.cut (content) print "|". Join (words)

6, module initialization mechanism change: lazy load (starting from the 0.28 version)

Download the dictionary you need, then overwrite jieba/dict.txt or overwrite it with jieba.set_dictionary ("").

1 #-*-coding:utf-8-*-2 3 import jieba4 jieba.set_dictionary ("./dict.txt") 5 content = open ("./content.txt", "RB"). Read () 6 words = jieba.cut (content) 7 print "|". Join (words)

7, Tokenize: Return the word in the original position

1) The first parameter is the text content.
2) The second parameter mode can be specified as the "search" search engine mode without the default mode.

1 #-*-coding:utf-8-*-2 import jieba3 4 result = jieba.tokenize (U ' It's a nice day today. Honey, let's go hiking! ') 5 for tokens in result:6     print "Word%s\t\t start:%d \t\t end:%d"% (token[0],token[1],token[2])

Python stutter (Jieba) participle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python stutter (Jieba) participle

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python stutter (Jieba) participle

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support