Jieba--"stuttering" Chinese word segmentation is a python Chinese sub-phrase developed by Sunjunyi, you can view the Jieba project on GitHub.
To use Jieba Chinese word segmentation, the first need to install Jieba Chinese word breaker, the author gives the following installation methods :
1. Fully automatic installation: easy_install jieba
or pip install jieba
/pip3 install jieba
2. Semi-automatic installation: First download http://pypi.python.org/pypi/jieba/, after decompression runpython setup.py install
3. Manual Installation: Place the Jieba directory in the current directory or site-packages directory
The author describes the algorithm used :
1. Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
2. Using dynamic programming to find the maximum probability path, find the maximum segmentation combination based on the word frequency.
3. For the non-login words, using the HMM model based on Chinese characters ' word-forming ability, the VITERBI algorithm is used.
main functions :
1. Participle
There are two main methods of jieba.cut and jieba.cut_for_search,
Where the Jieba.cut method accepts three input parameters :
1. A string that requires a word breaker;
The 2.cut_all parameter is used to control whether the whole mode is used;
The 3.HMM parameter is used to control whether the HMM model is used
Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=true)
Where the jieba.cut_for_search method accepts two input parameters :
1. A string that requires a word breaker;
2. Whether to use HMM models.
This method is suitable for the search engine to construct the inverted index word segmentation, the granularity is relatively fine
Jieba.cut_for_search ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after studying at Kyoto University in Japan")
Jieba.cut Methods and Jieba.cut The structure returned by the _for_search method is an iterative generatorthat can be used for loops to obtain each word (Unicode)
You can also use the Jieba.lcut method and the jieba.lcut_for_search method to return directly to the list
description of the Author: the string to be participle can be a Unicode or UTF-8 string, GBK string.
Note : It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8
here are the demo and running results given by the author :
# coding:utf-8#!/usr/bin/env pythonimport Jieba if __name__ = = ' __main__ ': seg_list = Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=True Print ("Full Mode:" + "/". Join (Seg_list)) #全模式 seg_list = Jieba.cut ("I came to Beijing Tsinghua University", Cut_all=false) print ("Default Mode:" + "/". Join (Seg_list)) #精确模式 seg_list = Jieba.cut ("He came to NetEase Hang Research building") #默认是精确模式 print (",". Join (seg_list)) Seg_list = Jieba.cut_ For_search ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after studying at Kyoto University in Japan") #搜索引擎模式 print (",". Join (Seg_list))
Results of the output
can see:
Full mode : try to cut the sentence most precisely, suitable for text analysis , output is all possible word combinations, such as Tsinghua University, will be divided into: Tsinghua, Tsinghua University, Hua da, university
default model (exact model): All the words in the sentence can be scanned out of the word, very fast, but can not solve ambiguity , such as Tsinghua University, will only export Tsinghua University
search engine mode : On the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle
There's another way.jieba.Tokenizer(dictionary=DEFAULT_DICT),用于新建自定义分词器,可用于同时使用不同词典。
jieba.dt
为默认分词器,所有全局分词相关函数都是该分词器的映射。
Python Natural language processing learning--jieba participle