First, the characteristics
1, support three kinds of participle mode:
(1) Precise mode: Try to cut the sentence most precisely, suitable for text analysis.
(2) Full mode: All words in the sentence can be scanned out, the speed is very fast, but can not solve the ambiguity.
(3) Search engine mode: On the basis of accurate mode, the long word again segmentation, improve the recall rate, suitable for search engine participle.
2. Support Traditional participle
3. Support Custom Dictionaries
Second, the realization
The realization principle of stuttering participle mainly has three points:
(1) A directed acyclic graph (DAG) composed of all possible words of Chinese characters in the sentence is generated by using the trie tree structure to achieve efficient word-map scanning.
(2) Using dynamic programming to find the maximum probability path, find out the maximum segmentation combination based on the word frequency.
(3) for the non-login words, using the HMM model based on the Chinese characters ' word-forming ability, the Viterbi algorithm is used.
Third, the application
Let's demonstrate the main function of stuttering participle
1. Participle
1 #-*-coding:utf-8-*-2 3 4 Import Jieba 5 6 7 8 "' 9 Cut method has two parameters 10 1) The first parameter is the string we want to participle 11 2) The second argument cut_a ll is used to control whether to use the full model of the "#全模式15 word_list = jieba.cut (" The weather is really good today. Honey, let's go hiking! ", cut_all=true) print" Full mode: "," | ". Join (word_list) #精确模式, the default is the exact mode word_list = Jieba.cut ("It's a nice day today.") Honey, let's go hiking! ", Cut_all=false) print" exact mode: "," | ". Join (word_list) #搜索引擎模式21 word_list = Jieba.cut_for_search ("It's a nice day. Honey, let's go hiking! ") print" search engine: "," | ". Join (Word_list)
2. Add a custom dictionary
Although Jieba has the ability to recognize new words, adding new words can guarantee a higher rate of correctness.
Developers can add a custom dictionary to their needs to include words that are not in the Jieba thesaurus.
Example: Little Red today we go to the places we used to go hiking? Why don't we change places! How about the Park garden? No problem. Small bean sprouts
Custom Dictionaries (Cu.txt):
Garden Park 5
Small bean sprouts 3 nr
A word occupies a line; each line is divided into three parts, one for words, the other for word frequency, and finally for part of speech (can be omitted), separated by a space in the middle.
1 #-*-coding:utf-8-*-2 import jieba3 4 jieba.load_userdict ("./cu.txt") 5 word_list = Jieba.cut ("Little Red" do we go hiking in places we used to go? Why don't we change places! How about the garden? Small bean sprouts ") 6 print" | ". Join (Word_list)
3. Keyword Extraction
1) The first parameter (setence) is the text to be extracted.
2) TOPK to return several TF/IDF weights the biggest keyword, the default value is 20, you can specify.
1 #-*-coding:utf-8-*-2 import Jieba.analyse as Al3 4 content = open ("./topk.txt", "RB"). Read () 5 WORD_TOPK = Al.extract_ta GS (CONTENT,TOPK=4) 6 print "|". Join (WORD_TOPK)
4. POS Tagging
The part of speech of each word after sentence segmentation is marked by the Ictclas compatible notation.
1 #-*-coding:utf-8-*-2 import jieba.posseg as pseg3 4 words = Pseg.cut ("Qingdao Beijing is a nice place") 5 for word in words:6 print word . Word,word.flag
Operation Result:
Qingdao NS
BEIJING NS
Is V
Nice A
The UJ
Place n
5. Parallel participle (can only be run on Linux system)
The text that will be participle is delimited by line, the lines of text are assigned to multiple Python processes, and then the result is merged, thus increasing the word speed.
The Python-based multiprocessing module does not currently support Windows systems.
#-*-coding:utf-8-*-import jieba# turn on parallel word segmentation, parameters are the number of processes participating in parallel participle jieba.enable_parallel (2) #关闭并行分词 #jieba.disable_parallel () Content = Open ("./topk.txt", "RB") words = jieba.cut (content) print "|". Join (words)
6, module initialization mechanism change: lazy load (starting from the 0.28 version)
Download the dictionary you need, then overwrite jieba/dict.txt or overwrite it with jieba.set_dictionary ("").
1 #-*-coding:utf-8-*-2 3 import jieba4 jieba.set_dictionary ("./dict.txt") 5 content = open ("./content.txt", "RB"). Read () 6 words = jieba.cut (content) 7 print "|". Join (words)
7, Tokenize: Return the word in the original position
1) The first parameter is the text content.
2) The second parameter mode can be specified as the "search" search engine mode without the default mode.
1 #-*-coding:utf-8-*-2 import jieba3 4 result = jieba.tokenize (U ' It's a nice day today. Honey, let's go hiking! ') 5 for tokens in result:6 print "Word%s\t\t start:%d \t\t end:%d"% (token[0],token[1],token[2])
Python stutter (Jieba) participle