Jieba
"Stuttering" Chinese participle: do the best Python Chinese sub-phrase pieces. : Https://github.com/fxsjy/jieba
Characteristics
Three types of Word breakers are supported:
- Precision mode, try to cut the sentence most accurately, suitable for text analysis;
- The whole mode, the sentence all can be the word words are scanned out, the speed is very fast, but can not solve the ambiguity;
- Search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.
Support Traditional participle
Support for custom dictionaries
MIT Licensing Agreement
Installation Instructions
Code is compatible with Python 2/3
- Fully automatic installation
easy_install jieba
: pip install jieba
or/pip3 install jieba
- Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/First, then run after decompression
python setup.py install
- Manual Installation: Place the Jieba directory in the current directory or the Site-packages directory
- by
import jieba
referring to
Algorithm
- Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
- Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
- For the non-login words, the HMM model based on Chinese characters ' lexical ability is adopted, and the VITERBI algorithm is used.
Key Features
- Word segmentation
jieba.cut
The method accepts three input parameters: A string that requires a word breaker, a cut_all parameter to control whether a full mode is used, and a hmm parameter to control the use of HMM models.
jieba.cut_for_search
The method accepts two parameters: A string that requires a word breaker, or whether to use a HMM model. This method is suitable for the search engine to construct the inverted index word segmentation, the granularity is relatively fine
- The string to be participle can be a Unicode or UTF-8 string, GBK string. Note: It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8
jieba.cut
and jieba.cut_for_search
The returned structure is an iterative generator, you can use the For loop to get every word (Unicode) you get after a word breaker, or
jieba.lcut
and jieba.lcut_for_search
return directly to list
-
jieba.Tokenizer(dictionary=DEFAULT_DICT)
Creates a new custom word breaker that can be used to use different dictionaries at the same time. jieba.dt
as the default word breaker, all global word-breaker-related functions are the mappings for this word breaker.
code example
# Encoding=utf-8import jiebaseg_list = Jieba.cut ("I came to Beijing Tsinghua University", cut_all=true) print ("Full Mode:" + "/". Join (seg_list))
# Full Mode seg_list = Jieba.cut ("I came to Beijing Tsinghua University", Cut_all=false) print ("Default mode:" + "/". Join (Seg_list)) # precision Mode seg_list = Jieba.cut ("He came to NetEase Hang Research Building") # Default is the Precision mode print (",". Join (seg_list)) Seg_list = Jieba.cut_for_search ("Xiao Ming graduated from the Chinese Academy of Sciences, After study at Kyoto University in Japan ") # Search engine mode print (", ". Join (Seg_list))
Output:
"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University "precise mode": I/Come/Beijing/Tsinghua University "new word recognition": He, came,, NetEase, Hang, building (here, "hang research" is not in the dictionary, but also by the Viterbi algorithm identified "Search engine mode": Xiao Ming, MA, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in, Japan, Kyoto, University, Kyoto University, Japan, Advanced education
- Add a custom dictionary
Loading dictionaries
- Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
- Usage: jieba.load_userdict (file_name) # file_name The path to a file class object or a custom dictionary
- The dictionary format
dict.txt
and the same, one word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed. file_name
If you open a file as a path or binary, the file must be UTF-8 encoded.
- The word frequency is guaranteed to be separated by the use of automatic calculation when the frequency is omitted.
For example:
Innovation Office 3 I cloud computing 5 Catherine NZ
Adjust Dictionaries
Use add_word(word, freq=None, tag=None)
and dynamically del_word(word)
modify dictionaries in your program.
Use suggest_freq(segment, tune=True)
the word frequency, which adjusts the individual words, so that it can (or cannot) be divided.
Note: The automatically calculated word frequency may not be valid when using the HMM new words Discovery feature.
code example:
>>> print ('/'. Join (Jieba.cut (') error if put in post. Hmm=false)) if/in/post//error/. >>> jieba.suggest_freq ((' Medium ', ' will '), True) 494>>> print ('/') Join (Jieba.cut (' If put in post will be faulted. ', hmm=false)) if/put in/post//will/error/. >>> print ('/') Join (Jieba.cut (' "Taichung" correctly should not be cut ', Hmm=false))) "//Middle/"/correct/should/not/BE/cut >>> Jieba.suggest_freq (' Taichung ', True) 69>>> print ('/') Join (Jieba.cut (' "Taichung" correctly should not be cut ', Hmm=false))) "/Taichung/"/Correct/should/ Not/BE/cut
- "Enhanced ambiguity correction capability through user-defined dictionaries"---HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/14
- Keyword extraction
Keyword extraction based on TF-IDF algorithm
import jieba.analyse
- Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
- Sentence for the text to be extracted
- TopK is the keyword that returns several TF/IDF weights, the default value is 20
- Withweight to return the keyword weight value, the default value is False
- Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered
- Jieba.analyse.TFIDF (Idf_path=none) New TFIDF instance, Idf_path as IDF frequency file
code example (keyword extraction)
https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Keyword extraction using the reverse file frequency (IDF) text corpus can be switched to the path of the custom corpus
- Usage: Jieba.analyse.set_idf_path (file_name) # file_name The path to the custom corpus
- Example of a custom corpus: Https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
- Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
Keyword extraction using the stop word (stop Words) text corpus can be switched to the path of a custom corpus
- Usage: jieba.analyse.set_stop_words (file_name) # file_name The path to the custom corpus
- Example of a custom corpus: Https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
- Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
Keywords return keyword weight value example
- Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
Keyword extraction based on Textrank algorithm
- Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) are used directly, the interface is the same, note the default filtering of part of speech.
- Jieba.analyse.TextRank () New custom Textrank instance
Algorithmic paper: Textrank:bringing Order into texts
Basic idea:
- Participle the text of the keyword to be extracted
- With fixed window size (default = 5, adjusted by span property), co-occurrence between words, build diagram
- Calculate the PageRank of the nodes in the graph, note that there is no right graph
Examples of Use:
See test/demo.py
- POS Labeling
- Creates a new custom word breaker
tokenizer
, which specifies the internal use jieba.Tokenizer
of a word breaker. jieba.posseg.POSTokenizer(tokenizer=None)
jieba.posseg.dt
label the word breaker for the default part of speech.
- The part of speech of each word after sentence segmentation is marked by the Ictclas compatible notation.
- Usage examples
>>> Import jieba.posseg as pseg>>> words = Pseg.cut ("I love Beijing Tian An Men") >>> for Word, flag in words:...
print ('%s%s '% (word, flag)) ... I r Love v Beijing NS Tiananmen Square NS
- Parallel participle
Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed
The multiprocessing module, which is based on Python, currently does not support Windows
Usage:
jieba.enable_parallel(4)
# Turn on parallel word-breaker with parameters of parallel processes
jieba.disable_parallel()
# Turn off parallel participle mode
Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
Experimental results: On the 4-core 3.4GHz Linux machine, the precise participle of the complete works of Jin Yong was obtained, and the speed of 1mb/s was 3.3 times times of the single process version.
Note : Parallel word breakers only support jieba.dt
The default word breaker and. jieba.posseg.dt
- Tokenize: Return the word at the beginning and end of the original
- Note that input parameters only accept Unicode
- Default mode
result = jieba.tokenize (U ' Yonghe Garment Ornament Co., Ltd. ') for TK in result: print ("Word%s\t\t start:%d \t\t end:%d"% (tk[0],tk[1],tk[2 ]) word Yonghe start:0 end:2word garments start:2 end:4word Ornament start:4 End:6word Co., Ltd. Start:6 End:10
result = jieba.tokenize (U ' Yonghe Garment Ornament Co., Ltd. ', mode= ' search ') for TK in result: print ("Word%s\t\t start:%d \t\t end:%d"% (tk [0],tk[1],tk[2]) word yonghe start:0 end:2word costume start:2 end:4word Ornament start:4 End:6word Limited start:6 End:8word company Start:8 End:10word Co., Ltd. start:6 end:10
- Chineseanalyzer for whoosh search engine
- References :
from jieba.analyse import ChineseAnalyzer
- Usage Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
- Command line participle
Examples of Use:python -m jieba news.txt > cut_result.txt
command-line Options (translation):
Use: Python-m jieba [options] filename stutter command line interface. Fixed parameters: filename input file optional parameter: -H,--help display this help message and exit -D [DELIM],--delimiter [DELIM] use DELIM separator words , instead of using the default '/'. If you do not specify DELIM, a space is used to separate it. -P [DELIM],--pos [DELIM] enable part-of- speech labeling, if you specify DELIM, the word and part of speech separated by it, otherwise _ -D DICT,--dict DICT use DICT Instead of the default dictionary- u user_dict,--user-dict user_dict uses the user_dict as an additional dictionary, with the default dictionary or custom dictionary with -A,--cut-all Full-mode participle (POS callout not supported)- N,--no-hmm does not use the implied Markov model -Q,--quiet does not output load information to STDERR -V,--version Display version information and exit if no file name is specified, standard input is used.
--help
Option output:
$> python-m jieba--helpjieba command line interface.positional arguments:filename input fileoptional A Rguments:-H,--help show this help message and exit-d [DELIM],--delimiter [DELIM] U Se DELIM instead of '/' for word delimiter; Or a space if it is used without delim-p [DELIM],--pos [DELIM] Enable POS Tagging If DELIM is specified, use DELIM instead of ' _ ' for POS delimiter-d DICT,--dict DICT use DICT a S dictionary-u user_dict,--user-dict user_dict use user_dict together with the default Dictionar Y or DICT (if specified)-A,--cut-all full pattern cutting (ignored with POS tagging)-N ,--no-hmm don ' t use the Hidden Markov model-q,--quiet don ' t print loading messages to stderr-v,- -version Show program's version number and exitif no filename specified, use STDIN instead.
Deferred loading mechanism
The Jieba uses lazy loading import jieba
, jieba.Tokenizer()
and does not immediately trigger the loading of the dictionary, and starts loading the dictionary building prefix dictionary once it is necessary. If you want to manually initialize the Jieba, you can also initialize it manually.
Import Jiebajieba.initialize () # Manual initialization (optional)
Before the 0.28 version is unable to specify the path of the main dictionary, after the delay loading mechanism, you can change the path of the main dictionary:
jieba.set_dictionary(‘data/dict.txt.big‘)
Example: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py
Other dictionaries
A dictionary file that consumes less memory https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
Support for traditional word segmentation better dictionary file Https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
Download the dictionary you need, then overwrite the jieba/dict.txt, or usejieba.set_dictionary(‘data/dict.txt.big‘)
Python third-party library Jieba (stuttering-Chinese word breaker) Getting Started and advanced (official documents)