Chinese word segmentation is a basic work in Chinese text processing, and stuttering participle is used for Chinese word segmentation. Its basic implementation principle has three points:
- Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
- Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
- For the non-login words, the hmm model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.
Installation (Linux environment)
Download the toolkit, unzip it into the directory, run: Python setup.py install
Mode
- Default mode, try to cut the sentence most accurately, suitable for text analysis
- Full mode, the sentence all can be the word words are scanned out, suitable for the search engine
Interface
- Component only provides Jieba.cut method for participle
- The Cut method accepts two input parameters:
- The first argument is a string that requires a word breaker
- Cut_all parameter to control word breaker mode
- The string to be participle can be a GBK string, a utf-8 string, or a Unicode
- The structure returned by Jieba.cut is an iterative generator that can use a for loop to obtain every word (Unicode) that is obtained after a word breaker, or it can be used with list (Jieba.cut (...)). Convert to List
Instance
# !-*-coding:utf-8-*- Import = jieba.cut (" I came to Beijing Tsinghua University ", Cut_all = True)print" Full Mode:"' = Jieba.cut (" I came to Tsinghua University in Beijing ")print"Default Mode:" '. Join (Seg_list)
Results
Implementation principle
1. Trie Tree: Reference http://www.cnblogs.com/kaituorensheng/p/3602155.html
Python Chinese participle: stuttering participle