Python third-party library Jieba (stuttering-Chinese word breaker) Getting Started and advanced (official documents)

Last Update:2018-06-30 Source: Internet

Author: User

Tags ming idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jieba

"Stuttering" Chinese participle: do the best Python Chinese sub-phrase pieces. : Https://github.com/fxsjy/jieba

Characteristics

Three types of Word breakers are supported:
- Precision mode, try to cut the sentence most accurately, suitable for text analysis;
- The whole mode, the sentence all can be the word words are scanned out, the speed is very fast, but can not solve the ambiguity;
- Search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.
Support Traditional participle
Support for custom dictionaries
MIT Licensing Agreement

Installation Instructions

Code is compatible with Python 2/3

Fully automatic installation easy_install jieba : pip install jieba or/pip3 install jieba
Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/First, then run after decompressionpython setup.py install
Manual Installation: Place the Jieba directory in the current directory or the Site-packages directory
by import jieba referring to

Algorithm

Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
For the non-login words, the HMM model based on Chinese characters ' lexical ability is adopted, and the VITERBI algorithm is used.

Key Features

Word segmentation

jieba.cutThe method accepts three input parameters: A string that requires a word breaker, a cut_all parameter to control whether a full mode is used, and a hmm parameter to control the use of HMM models.
jieba.cut_for_searchThe method accepts two parameters: A string that requires a word breaker, or whether to use a HMM model. This method is suitable for the search engine to construct the inverted index word segmentation, the granularity is relatively fine
The string to be participle can be a Unicode or UTF-8 string, GBK string. Note: It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8
jieba.cutand jieba.cut_for_search The returned structure is an iterative generator, you can use the For loop to get every word (Unicode) you get after a word breaker, or
jieba.lcutand jieba.lcut_for_search return directly to list
jieba.Tokenizer(dictionary=DEFAULT_DICT) Creates a new custom word breaker that can be used to use different dictionaries at the same time. jieba.dt as the default word breaker, all global word-breaker-related functions are the mappings for this word breaker.

code example

# Encoding=utf-8import jiebaseg_list = Jieba.cut ("I came to Beijing Tsinghua University", cut_all=true) print ("Full Mode:" + "/". Join (seg_list)) 
   # Full Mode seg_list = Jieba.cut ("I came to Beijing Tsinghua University", Cut_all=false) print ("Default mode:" + "/". Join (Seg_list))  # precision Mode seg_list = Jieba.cut ("He came to NetEase Hang Research Building")  # Default is the Precision mode print (",". Join (seg_list)) Seg_list = Jieba.cut_for_search ("Xiao Ming graduated from the Chinese Academy of Sciences, After study at Kyoto University in Japan ")  # Search engine mode print (", ". Join (Seg_list))

Output:

"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University "precise mode": I/Come/Beijing/Tsinghua University "new word recognition": He, came,, NetEase, Hang, building    (here, "hang research" is not in the dictionary, but also by the Viterbi algorithm identified "Search engine mode": Xiao Ming, MA, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in, Japan, Kyoto, University, Kyoto University, Japan, Advanced education

Add a custom dictionary

Loading dictionaries

Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
Usage: jieba.load_userdict (file_name) # file_name The path to a file class object or a custom dictionary
The dictionary format dict.txt and the same, one word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed. file_name If you open a file as a path or binary, the file must be UTF-8 encoded.
The word frequency is guaranteed to be separated by the use of automatic calculation when the frequency is omitted.

For example:

Innovation Office 3 I cloud computing 5 Catherine NZ

Change the and jieba.dt tmp_dir Properties of the word breaker (default) to specify the folder where the cache file resides and its file name for the restricted file system. cache_file
Example:
- Custom dictionaries: Https://github.com/fxsjy/jieba/blob/master/test/userdict.txt
- Usage Example: https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py
  - Before: Li Xiaofu/Yes/innovation/office/Director/also/yes/cloud/calculation/aspect//Expert/
  - After loading the custom thesaurus: Li Xiaofu/Yes/innovation Office/Director/also/yes/cloud/aspect/expert/

Adjust Dictionaries

Use add_word(word, freq=None, tag=None) and dynamically del_word(word) modify dictionaries in your program.
Use suggest_freq(segment, tune=True) the word frequency, which adjusts the individual words, so that it can (or cannot) be divided.
Note: The automatically calculated word frequency may not be valid when using the HMM new words Discovery feature.

code example:

>>> print ('/'. Join (Jieba.cut (') error if put in post. Hmm=false)) if/in/post//error/. >>> jieba.suggest_freq ((' Medium ', ' will '), True) 494>>> print ('/') Join (Jieba.cut (' If put in post will be faulted. ', hmm=false)) if/put in/post//will/error/. >>> print ('/') Join (Jieba.cut (' "Taichung" correctly should not be cut ', Hmm=false))) "//Middle/"/correct/should/not/BE/cut >>> Jieba.suggest_freq (' Taichung ', True) 69>>> print ('/') Join (Jieba.cut (' "Taichung" correctly should not be cut ', Hmm=false))) "/Taichung/"/Correct/should/ Not/BE/cut

"Enhanced ambiguity correction capability through user-defined dictionaries"---HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/14

Keyword extraction

Keyword extraction based on TF-IDF algorithm

import jieba.analyse

Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
- Sentence for the text to be extracted
- TopK is the keyword that returns several TF/IDF weights, the default value is 20
- Withweight to return the keyword weight value, the default value is False
- Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered
Jieba.analyse.TFIDF (Idf_path=none) New TFIDF instance, Idf_path as IDF frequency file

code example (keyword extraction)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

Keyword extraction using the reverse file frequency (IDF) text corpus can be switched to the path of the custom corpus

Usage: Jieba.analyse.set_idf_path (file_name) # file_name The path to the custom corpus
Example of a custom corpus: Https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

Keyword extraction using the stop word (stop Words) text corpus can be switched to the path of a custom corpus

Usage: jieba.analyse.set_stop_words (file_name) # file_name The path to the custom corpus
Example of a custom corpus: Https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

Keywords return keyword weight value example

Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py

Keyword extraction based on Textrank algorithm

Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) are used directly, the interface is the same, note the default filtering of part of speech.
Jieba.analyse.TextRank () New custom Textrank instance

Algorithmic paper: Textrank:bringing Order into texts

Basic idea:

Participle the text of the keyword to be extracted
With fixed window size (default = 5, adjusted by span property), co-occurrence between words, build diagram
Calculate the PageRank of the nodes in the graph, note that there is no right graph

Examples of Use:

See test/demo.py

POS Labeling

Creates a new custom word breaker tokenizer , which specifies the internal use jieba.Tokenizer of a word breaker. jieba.posseg.POSTokenizer(tokenizer=None) jieba.posseg.dtlabel the word breaker for the default part of speech.
The part of speech of each word after sentence segmentation is marked by the Ictclas compatible notation.
Usage examples

>>> Import jieba.posseg as pseg>>> words = Pseg.cut ("I love Beijing Tian An Men") >>> for Word, flag in words:...
   print ('%s%s '% (word, flag)) ... I r Love v Beijing NS Tiananmen Square NS

Parallel participle

Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed
The multiprocessing module, which is based on Python, currently does not support Windows
Usage:
- jieba.enable_parallel(4)# Turn on parallel word-breaker with parameters of parallel processes
- jieba.disable_parallel()# Turn off parallel participle mode
Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
Experimental results: On the 4-core 3.4GHz Linux machine, the precise participle of the complete works of Jin Yong was obtained, and the speed of 1mb/s was 3.3 times times of the single process version.
Note : Parallel word breakers only support jieba.dt The default word breaker and. jieba.posseg.dt

Tokenize: Return the word at the beginning and end of the original

Note that input parameters only accept Unicode
Default mode

result = jieba.tokenize (U ' Yonghe Garment Ornament Co., Ltd. ') for TK in result:    print ("Word%s\t\t start:%d \t\t end:%d"% (tk[0],tk[1],tk[2 ]) word Yonghe                start:0                end:2word garments                start:2                end:4word Ornament                start:4                End:6word Co., Ltd.            Start:6                End:10

Search mode

result = jieba.tokenize (U ' Yonghe Garment Ornament Co., Ltd. ', mode= ' search ') for TK in result:    print ("Word%s\t\t start:%d \t\t end:%d"% (tk [0],tk[1],tk[2]) word yonghe                start:0                end:2word costume                start:2                end:4word Ornament                start:4                End:6word Limited                start:6                End:8word company                Start:8                End:10word Co., Ltd.            start:6                end:10

Chineseanalyzer for whoosh search engine

References : from jieba.analyse import ChineseAnalyzer
Usage Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py

Command line participle

Examples of Use:python -m jieba news.txt > cut_result.txt

command-line Options (translation):

Use: Python-m jieba [options] filename stutter command line interface. Fixed parameters:  filename              input file optional parameter:  -H,--help            display this help message and exit  -D [DELIM],--delimiter [DELIM]                        use DELIM separator words , instead of using the default '/'.                        If you do not specify DELIM, a space is used to separate it.  -P [DELIM],--pos [DELIM] enable part-of-                        speech labeling, if you specify DELIM, the word and part                        of speech separated by it, otherwise _  -D DICT,--dict DICT  use DICT Instead of the default dictionary-  u user_dict,--user-dict user_dict uses the                        user_dict as an additional dictionary, with the default dictionary or custom dictionary with  -A,--cut-all         Full-mode participle (POS callout not supported)-  N,--no-hmm          does not use the implied Markov model  -Q,--quiet           does not output load information to STDERR  -V,--version         Display version information and exit if no file name is specified, standard input is used.

--help Option output:

$> python-m jieba--helpjieba command line interface.positional arguments:filename input fileoptional A Rguments:-H,--help show this help message and exit-d [DELIM],--delimiter [DELIM] U Se DELIM instead of '/' for word delimiter;  Or a space if it is used without delim-p [DELIM],--pos [DELIM] Enable POS Tagging If DELIM is specified, use DELIM instead of ' _ ' for POS delimiter-d DICT,--dict DICT use DICT a S dictionary-u user_dict,--user-dict user_dict use user_dict together with the default Dictionar Y or DICT (if specified)-A,--cut-all full pattern cutting (ignored with POS tagging)-N ,--no-hmm don ' t use the Hidden Markov model-q,--quiet don ' t print loading messages to stderr-v,- -version Show program's version number and exitif no filename specified, use STDIN instead.

Deferred loading mechanism

The Jieba uses lazy loading import jieba , jieba.Tokenizer() and does not immediately trigger the loading of the dictionary, and starts loading the dictionary building prefix dictionary once it is necessary. If you want to manually initialize the Jieba, you can also initialize it manually.

Import Jiebajieba.initialize ()  # Manual initialization (optional)

Before the 0.28 version is unable to specify the path of the main dictionary, after the delay loading mechanism, you can change the path of the main dictionary:

jieba.set_dictionary(‘data/dict.txt.big‘)

Example: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py

Other dictionaries

A dictionary file that consumes less memory https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
Support for traditional word segmentation better dictionary file Https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

Download the dictionary you need, then overwrite the jieba/dict.txt, or usejieba.set_dictionary(‘data/dict.txt.big‘)

Python third-party library Jieba (stuttering-Chinese word breaker) Getting Started and advanced (official documents)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More