Python third-party library Jieba (stuttering-Chinese word breaker) Getting Started and advanced (official documents)

Source: Internet
Author: User
Tags ming idf

Jieba

"Stuttering" Chinese participle: do the best Python Chinese sub-phrase pieces. : Https://github.com/fxsjy/jieba

Characteristics
    • Three types of Word breakers are supported:

      • Precision mode, try to cut the sentence most accurately, suitable for text analysis;
      • The whole mode, the sentence all can be the word words are scanned out, the speed is very fast, but can not solve the ambiguity;
      • Search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.
    • Support Traditional participle

    • Support for custom dictionaries

    • MIT Licensing Agreement

Installation Instructions

Code is compatible with Python 2/3

    • Fully automatic installation easy_install jieba : pip install jieba or/pip3 install jieba
    • Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/First, then run after decompressionpython setup.py install
    • Manual Installation: Place the Jieba directory in the current directory or the Site-packages directory
    • by import jieba referring to
Algorithm
    • Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
    • Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
    • For the non-login words, the HMM model based on Chinese characters ' lexical ability is adopted, and the VITERBI algorithm is used.
Key Features
    1. Word segmentation
  • jieba.cutThe method accepts three input parameters: A string that requires a word breaker, a cut_all parameter to control whether a full mode is used, and a hmm parameter to control the use of HMM models.
  • jieba.cut_for_searchThe method accepts two parameters: A string that requires a word breaker, or whether to use a HMM model. This method is suitable for the search engine to construct the inverted index word segmentation, the granularity is relatively fine
  • The string to be participle can be a Unicode or UTF-8 string, GBK string. Note: It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8
  • jieba.cutand jieba.cut_for_search The returned structure is an iterative generator, you can use the For loop to get every word (Unicode) you get after a word breaker, or
  • jieba.lcutand jieba.lcut_for_search return directly to list
  • jieba.Tokenizer(dictionary=DEFAULT_DICT) Creates a new custom word breaker that can be used to use different dictionaries at the same time. jieba.dt as the default word breaker, all global word-breaker-related functions are the mappings for this word breaker.

code example

# Encoding=utf-8import jiebaseg_list = Jieba.cut ("I came to Beijing Tsinghua University", cut_all=true) print ("Full Mode:" + "/". Join (seg_list)) 
   # Full Mode seg_list = Jieba.cut ("I came to Beijing Tsinghua University", Cut_all=false) print ("Default mode:" + "/". Join (Seg_list))  # precision Mode seg_list = Jieba.cut ("He came to NetEase Hang Research Building")  # Default is the Precision mode print (",". Join (seg_list)) Seg_list = Jieba.cut_for_search ("Xiao Ming graduated from the Chinese Academy of Sciences, After study at Kyoto University in Japan ")  # Search engine mode print (", ". Join (Seg_list))

  

Output:

"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University "precise mode": I/Come/Beijing/Tsinghua University "new word recognition": He, came,, NetEase, Hang, building    (here, "hang research" is not in the dictionary, but also by the Viterbi algorithm identified "Search engine mode": Xiao Ming, MA, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in, Japan, Kyoto, University, Kyoto University, Japan, Advanced education

  

    1. Add a custom dictionary
Loading dictionaries
    • Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
    • Usage: jieba.load_userdict (file_name) # file_name The path to a file class object or a custom dictionary
    • The dictionary format dict.txt and the same, one word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed. file_name If you open a file as a path or binary, the file must be UTF-8 encoded.
    • The word frequency is guaranteed to be separated by the use of automatic calculation when the frequency is omitted.

For example:

Innovation Office 3 I cloud computing 5 Catherine NZ
    • Change the and jieba.dt tmp_dir Properties of the word breaker (default) to specify the folder where the cache file resides and its file name for the restricted file system. cache_file

    • Example:

      • Custom dictionaries: Https://github.com/fxsjy/jieba/blob/master/test/userdict.txt

      • Usage Example: https://github.com/fxsjy/jieba/blob/master/test/test_userdict.py

        • Before: Li Xiaofu/Yes/innovation/office/Director/also/yes/cloud/calculation/aspect//Expert/

        • After loading the custom thesaurus: Li Xiaofu/Yes/innovation Office/Director/also/yes/cloud/aspect/expert/

Adjust Dictionaries
    • Use add_word(word, freq=None, tag=None) and dynamically del_word(word) modify dictionaries in your program.

    • Use suggest_freq(segment, tune=True) the word frequency, which adjusts the individual words, so that it can (or cannot) be divided.

    • Note: The automatically calculated word frequency may not be valid when using the HMM new words Discovery feature.

code example:

>>> print ('/'. Join (Jieba.cut (') error if put in post. Hmm=false)) if/in/post//error/. >>> jieba.suggest_freq ((' Medium ', ' will '), True) 494>>> print ('/') Join (Jieba.cut (' If put in post will be faulted. ', hmm=false)) if/put in/post//will/error/. >>> print ('/') Join (Jieba.cut (' "Taichung" correctly should not be cut ', Hmm=false))) "//Middle/"/correct/should/not/BE/cut >>> Jieba.suggest_freq (' Taichung ', True) 69>>> print ('/') Join (Jieba.cut (' "Taichung" correctly should not be cut ', Hmm=false))) "/Taichung/"/Correct/should/ Not/BE/cut
    • "Enhanced ambiguity correction capability through user-defined dictionaries"---HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/14
    1. Keyword extraction
Keyword extraction based on TF-IDF algorithm

import jieba.analyse

    • Jieba.analyse.extract_tags (sentence, topk=20, Withweight=false, allowpos= ())
      • Sentence for the text to be extracted
      • TopK is the keyword that returns several TF/IDF weights, the default value is 20
      • Withweight to return the keyword weight value, the default value is False
      • Allowpos only includes words with the specified part of speech, the default value is empty, i.e. not filtered
    • Jieba.analyse.TFIDF (Idf_path=none) New TFIDF instance, Idf_path as IDF frequency file

code example (keyword extraction)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

Keyword extraction using the reverse file frequency (IDF) text corpus can be switched to the path of the custom corpus

    • Usage: Jieba.analyse.set_idf_path (file_name) # file_name The path to the custom corpus
    • Example of a custom corpus: Https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
    • Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py

Keyword extraction using the stop word (stop Words) text corpus can be switched to the path of a custom corpus

    • Usage: jieba.analyse.set_stop_words (file_name) # file_name The path to the custom corpus
    • Example of a custom corpus: Https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
    • Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py

Keywords return keyword weight value example

    • Usage Example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_with_weight.py
Keyword extraction based on Textrank algorithm
    • Jieba.analyse.textrank (sentence, topk=20, Withweight=false, allowpos= (' ns ', ' n ', ' vn ', ' V ')) are used directly, the interface is the same, note the default filtering of part of speech.
    • Jieba.analyse.TextRank () New custom Textrank instance

Algorithmic paper: Textrank:bringing Order into texts

Basic idea:
    1. Participle the text of the keyword to be extracted
    2. With fixed window size (default = 5, adjusted by span property), co-occurrence between words, build diagram
    3. Calculate the PageRank of the nodes in the graph, note that there is no right graph
Examples of Use:

See test/demo.py

    1. POS Labeling
    • Creates a new custom word breaker tokenizer , which specifies the internal use jieba.Tokenizer of a word breaker. jieba.posseg.POSTokenizer(tokenizer=None) jieba.posseg.dtlabel the word breaker for the default part of speech.
    • The part of speech of each word after sentence segmentation is marked by the Ictclas compatible notation.
    • Usage examples
>>> Import jieba.posseg as pseg>>> words = Pseg.cut ("I love Beijing Tian An Men") >>> for Word, flag in words:...
   print ('%s%s '% (word, flag)) ... I r Love v Beijing NS Tiananmen Square NS

  

    1. Parallel participle
  • Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed

  • The multiprocessing module, which is based on Python, currently does not support Windows

  • Usage:

    • jieba.enable_parallel(4)# Turn on parallel word-breaker with parameters of parallel processes
    • jieba.disable_parallel()# Turn off parallel participle mode
  • Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py

  • Experimental results: On the 4-core 3.4GHz Linux machine, the precise participle of the complete works of Jin Yong was obtained, and the speed of 1mb/s was 3.3 times times of the single process version.

  • Note : Parallel word breakers only support jieba.dt The default word breaker and. jieba.posseg.dt

    1. Tokenize: Return the word at the beginning and end of the original
    • Note that input parameters only accept Unicode
    • Default mode
result = jieba.tokenize (U ' Yonghe Garment Ornament Co., Ltd. ') for TK in result:    print ("Word%s\t\t start:%d \t\t end:%d"% (tk[0],tk[1],tk[2 ]) word Yonghe                start:0                end:2word garments                start:2                end:4word Ornament                start:4                End:6word Co., Ltd.            Start:6                End:10

  

    • Search mode
result = jieba.tokenize (U ' Yonghe Garment Ornament Co., Ltd. ', mode= ' search ') for TK in result:    print ("Word%s\t\t start:%d \t\t end:%d"% (tk [0],tk[1],tk[2]) word yonghe                start:0                end:2word costume                start:2                end:4word Ornament                start:4                End:6word Limited                start:6                End:8word company                Start:8                End:10word Co., Ltd.            start:6                end:10

  

    1. Chineseanalyzer for whoosh search engine
    • References : from jieba.analyse import ChineseAnalyzer
    • Usage Example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py
    1. Command line participle

Examples of Use:python -m jieba news.txt > cut_result.txt

command-line Options (translation):

Use: Python-m jieba [options] filename stutter command line interface. Fixed parameters:  filename              input file optional parameter:  -H,--help            display this help message and exit  -D [DELIM],--delimiter [DELIM]                        use DELIM separator words , instead of using the default '/'.                        If you do not specify DELIM, a space is used to separate it.  -P [DELIM],--pos [DELIM] enable part-of-                        speech labeling, if you specify DELIM, the word and part                        of speech separated by it, otherwise _  -D DICT,--dict DICT  use DICT Instead of the default dictionary-  u user_dict,--user-dict user_dict uses the                        user_dict as an additional dictionary, with the default dictionary or custom dictionary with  -A,--cut-all         Full-mode participle (POS callout not supported)-  N,--no-hmm          does not use the implied Markov model  -Q,--quiet           does not output load information to STDERR  -V,--version         Display version information and exit if no file name is specified, standard input is used.

  

--help   Option output:

$> python-m jieba--helpjieba command line interface.positional arguments:filename input fileoptional A Rguments:-H,--help show this help message and exit-d [DELIM],--delimiter [DELIM] U Se DELIM instead of '/' for word delimiter;  Or a space if it is used without delim-p [DELIM],--pos [DELIM] Enable POS Tagging If DELIM is specified, use DELIM instead of ' _ ' for POS delimiter-d DICT,--dict DICT use DICT a S dictionary-u user_dict,--user-dict user_dict use user_dict together with the default Dictionar Y or DICT (if specified)-A,--cut-all full pattern cutting (ignored with POS tagging)-N ,--no-hmm don ' t use the Hidden Markov model-q,--quiet don ' t print loading messages to stderr-v,- -version Show program's version number and exitif no filename specified, use STDIN instead. 

  

Deferred loading mechanism

The Jieba uses lazy loading import jieba , jieba.Tokenizer() and does not immediately trigger the loading of the dictionary, and starts loading the dictionary building prefix dictionary once it is necessary. If you want to manually initialize the Jieba, you can also initialize it manually.

Import Jiebajieba.initialize ()  # Manual initialization (optional)

  

Before the 0.28 version is unable to specify the path of the main dictionary, after the delay loading mechanism, you can change the path of the main dictionary:

jieba.set_dictionary(‘data/dict.txt.big‘)

Example: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py

Other dictionaries
    1. A dictionary file that consumes less memory https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small

    2. Support for traditional word segmentation better dictionary file Https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

Download the dictionary you need, then overwrite the jieba/dict.txt, or usejieba.set_dictionary(‘data/dict.txt.big‘)

Python third-party library Jieba (stuttering-Chinese word breaker) Getting Started and advanced (official documents)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.