Python Chinese Word Segmentation component jieba

Last Update:2013-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jieba "Jieba" Chinese Word Segmentation: The best Python Chinese Word Segmentation component "Jieba"

Feature

Three word segmentation modes are supported:
- Accurate mode, which is suitable for text analysis;
- Full mode: scans all words in a sentence that can be used as words. The speed is very fast, but ambiguity cannot be solved;
- The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.
Supports traditional Chinese Word Segmentation
Supports custom dictionaries

Online Demo

Http://jiebademo.ap01.aws.af.cm/

(Powered by Appfog)

Install Python 2.x

Currently, the master Branch only supports Python2.x.
Python3.x Branch is also basically available: https://github.com/fxsjy/jieba/tree/jieba3k
```
git clone https://github.com/fxsjy/jieba.gitgit checkout jieba3kpython setup.py install
```

Algorithm

The jieba. cut Method accepts two input parameters: 1) the first parameter is the string to be segmented; 2) the cut_all parameter is used to control whether full mode is used.
The jieba. cut_for_search method accepts a parameter: string to be segmented. This method is suitable for word segmentation when a search engine constructs an inverted index with a fine granularity.
Note: The string to be segmented can be a gbk string, UTF-8 string, or unicode string.
Jieba. cut and jieba. the structure returned by cut_for_search is an iterative generator. You can use the for loop to obtain every word (unicode) obtained after word segmentation, or use list (jieba. cut (...)) convert to list

Sample Code (Word Segmentation)

# Encoding = utf-8import jiebaseg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = True) print "Full Mode :","/". join (seg_list) # Full mode seg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = False) print "Default Mode :","/". join (seg_list) # exact mode seg_list = jieba. cut ("he has come to Netease hang Yan building") # The default is the accurate mode print ",". join (seg_list) seg_list = jieba. cut_for_search ("James graduated from the Institute of Computing Science of the Chinese Emy of sciences and later studied at Kyoto University in Japan") # search engine mode print ",". join (seg_list)

Output:

[Full mode]: I/come/Beijing/Tsinghua University/Huada/University [exact mode]: I/come/Beijing/Tsinghua University [New Word Recognition]: He, come, Netease, hangyan, building (here, "hangyan" is not in the dictionary, but it is also identified by Viterbi algorithms) [Search Engine mode]: James, master's degree, graduated from, China, science, college, Emy of sciences, Chinese Emy of Sciences, Institute of computing, and later in, Japan, Kyoto University, Kyoto University, Japan for further studies

Function 2): Add a custom dictionary

Jieba. analyze. extract_tags (sentence, topK) # import jieba. analyze first
Setence is the text to be extracted
TopK is the keyword that returns the maximum TF/IDF weight. The default value is 20.

Code example (keyword extraction)

https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py

Function 4): part-of-speech tagging

Principle: after the target text is separated by rows, each row of text is allocated to multiple python processes for parallel word segmentation, and then the results are merged to achieve a considerable improvement in word segmentation speed.
The multiprocessing Module Based on python does not currently support windows
Usage:
- Jieba. enable_parallel (4) # enable the parallel word splitting mode. The parameter is the number of parallel processes.
- Jieba. disable_parallel () # disable the parallel word splitting Mode
Example: https://github.com/fxsjy/jieba/blob/master/test/parallel/test_file.py
Experimental results: On a 4-core 3.4 GHz Linux machine, Jin Yong's complete set is precisely segmented to get a speed of 1 Mbit/s, 3.3 times that of a single-process version.

Function 6): Tokenize: returns the start position of a word in the original text.

Note that the input parameters only accept unicode
Default Mode

Result = jieba. tokenize (u'yonghe clothing & accessories Co., Ltd ')

for tk in result:

    print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])

Word permanent and start: 0 end: 2 word clothing start: 2 end: 4 word jewelry start: 4 end: 6 word limited company start: 6 end: 10

Search Mode

Result = jieba. tokenize (u'yonghe clothing & accessories Co., Ltd. ', mode = 'search ')

for tk in result:

    print "word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2])

Word permanent and start: 0 end: 2 word clothing start: 2 end: 4 word jewelry start: 4 end: 6 word limited start: 6 end: 8 word company start: 8 end: 10 word Company Limited start: 6 end: 10

Function 7): ChineseAnalyzer for Whoosh search engine reference: from jieba. analyze import ChineseAnalyzer usage example: https://github.com/fxsjy/jieba/blob/master/test/test_whoosh.py other dictionaries download the dictionary you need, and then overwrite jieba/dict.txt or use jieba. set_dictionary ('data/dict.txt. big ') module initialization Mechanism Change: lazy load (from version 0.28) uses delayed loading in jieba, and "import jieba" does not trigger dictionary loading immediately, load the dictionary to build the trie once necessary. If you want to manually initialize jieba, you can also manually initialize it. Import jiebajieba. initialize () # manual initialization (optional) versions earlier than 0.28 cannot specify the path of the main dictionary. With the delayed loading mechanism, you can change the path of the main dictionary: jieba. set_dictionary ('data/dict.txt. big ') Example: https://github.com/fxsjy/jieba/blob/master/test/test_change_dictpath.py word splitting speed 1) How is the model data generated? Https://github.com/fxsjy/jieba/issues/7 2) What is the authorization for this library? Https://github.com/fxsjy/jieba/issues/2 more questions please CLICK: https://github.com/fxsjy/jieba/issues? Sort = updated & state = closed Change Log http://www.oschina.net/p/jieba/news#list http://www.oschina.net/p/jieba https://github.com/fxsjy/jieba

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Chinese Word Segmentation component jieba

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Chinese Word Segmentation component jieba

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support