Python Chinese Word Segmentation component jieba

Source: Internet
Author: User


Jieba "Jieba" Chinese Word Segmentation: The best Python Chinese Word Segmentation component "Jieba"

 

Feature
  • Three word segmentation modes are supported:

    • Accurate mode, which is suitable for text analysis;
    • Full mode: scans all words in a sentence that can be used as words. The speed is very fast, but ambiguity cannot be solved;
    • The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.
  • Supports traditional Chinese Word Segmentation

  • Supports custom dictionaries

Online Demo

Http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

 

Install Python 2.x
  • Currently, the master Branch only supports Python2.x.
  • Python3.x Branch is also basically available: https://github.com/fxsjy/jieba/tree/jieba3k

    git clone https://github.com/fxsjy/jieba.gitgit checkout jieba3kpython setup.py install
Algorithm
  • The jieba. cut Method accepts two input parameters: 1) the first parameter is the string to be segmented; 2) the cut_all parameter is used to control whether full mode is used.
  • The jieba. cut_for_search method accepts a parameter: string to be segmented. This method is suitable for word segmentation when a search engine constructs an inverted index with a fine granularity.
  • Note: The string to be segmented can be a gbk string, UTF-8 string, or unicode string.
  • Jieba. cut and jieba. the structure returned by cut_for_search is an iterative generator. You can use the for loop to obtain every word (unicode) obtained after word segmentation, or use list (jieba. cut (...)) convert to list
Sample Code (Word Segmentation)

 

# Encoding = utf-8import jiebaseg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = True) print "Full Mode :","/". join (seg_list) # Full mode seg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = False) print "Default Mode :","/". join (seg_list) # exact mode seg_list = jieba. cut ("he has come to Netease hang Yan building") # The default is the accurate mode print ",". join (seg_list) seg_list = jieba. cut_for_search ("James graduated from the Institute of Computing Science of the Chinese Emy of sciences and later studied at Kyoto University in Japan") # search engine mode print ",". join (seg_list)
Output:

 

[Full mode]: I/come/Beijing/Tsinghua University/Huada/University [exact mode]: I/come/Beijing/Tsinghua University [New Word Recognition]: He, come, Netease, hangyan, building (here, "hangyan" is not in the dictionary, but it is also identified by Viterbi algorithms) [Search Engine mode]: James, master's degree, graduated from, China, science, college, Emy of sciences, Chinese Emy of Sciences, Institute of computing, and later in, Japan, Kyoto University, Kyoto University, Japan for further studies
Function 2): Add a custom dictionary
  • Jieba. analyze. extract_tags (sentence, topK) # import jieba. analyze first
  • Setence is the text to be extracted
  • TopK is the keyword that returns the maximum TF/IDF weight. The default value is 20.
Code example (keyword extraction)

 

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py
Function 4): part-of-speech tagging
  • Principle: after the target text is separated by rows, each row of text is allocated to multiple python processes for parallel word segmentation, and then the results are merged to achieve a considerable improvement in word segmentation speed.
  • The multiprocessing Module Based on python does not currently support windows
  • Usage:

    • Jieba. enable_parallel (4) # enable the parallel word splitting mode. The parameter is the number of parallel processes.
    • Jieba. disable_parallel () # disable the parallel word splitting Mode
  • Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py

  • Experimental results: On a 4-core 3.4 GHz Linux machine, Jin Yong's complete set is precisely segmented to get a speed of 1 Mbit/s, 3.3 times that of a single-process version.

Function 6): Tokenize: returns the start position of a word in the original text.
  • Note that the input parameters only accept unicode
  • Default Mode

 

Result = jieba. tokenize (u'yonghe clothing & accessories Co., Ltd ')
for tk in result:
    print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]) 
Word permanent and start: 0 end: 2 word clothing start: 2 end: 4 word jewelry start: 4 end: 6 word limited company start: 6 end: 10
  • Search Mode

 

Result = jieba. tokenize (u'yonghe clothing & accessories Co., Ltd. ', mode = 'search ')
for tk in result:
    print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]) 
Word permanent and start: 0 end: 2 word clothing start: 2 end: 4 word jewelry start: 4 end: 6 word limited start: 6 end: 8 word company start: 8 end: 10 word Company Limited start: 6 end: 10
Function 7): ChineseAnalyzer for Whoosh search engine reference: from jieba. analyze import ChineseAnalyzer usage example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py other dictionaries download the dictionary you need, and then overwrite jieba/dict.txt or use jieba. set_dictionary ('data/dict.txt. big ') module initialization Mechanism Change: lazy load (from version 0.28) uses delayed loading in jieba, and "import jieba" does not trigger dictionary loading immediately, load the dictionary to build the trie once necessary. If you want to manually initialize jieba, you can also manually initialize it. Import jiebajieba. initialize () # manual initialization (optional) versions earlier than 0.28 cannot specify the path of the main dictionary. With the delayed loading mechanism, you can change the path of the main dictionary: jieba. set_dictionary ('data/dict.txt. big ') Example: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py word splitting speed 1) How is the model data generated? Https://github.com/fxsjy/jieba/issues/7 2) What is the authorization for this library? Https://github.com/fxsjy/jieba/issues/2 more questions please CLICK: https://github.com/fxsjy/jieba/issues? Sort = updated & state = closed Change Log http://www.oschina.net/p/jieba/news#list http://www.oschina.net/p/jieba https://github.com/fxsjy/jieba

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.