Chinese word breaker tool: stuttering participle
GitHub Address: Https://github.com/fxsjy/jieba
Word breaker function
- Exact mode (default): Try to cut the sentence most precisely, suitable for text analysis;
- Full mode, the sentence all can be the word words are scanned out, but can not solve the ambiguity;
Search engine mode, on the basis of accurate mode, the long word again segmentation, improve the recall rate, this method is suitable for search engine construction Inverted index of word segmentation, granularity is relatively thin.
Note : The structure returned byJieba.cut and Jieba.cut_for_search is an iterative generator, not a list of lists.
Example code 1
#!/usr/bin/env python#-*-Coding:utf-8-*-"" "function: Stuttering word test, basic participle function time: May 21, 2016 15:44:24 " ""ImportJieba# participle ModeSEG = Jieba.cut ("This is a book on information retrieval.", cut_all=True)# Cut_all=true, full modePrintU "full mode word breaker:"+"/ ". Join (SEG)) seg = Jieba.cut ("This is a book on information retrieval.", cut_all=False)# Cut_all=false, precision modePrintu "exact mode participle:"+"/ ". Join (SEG)) seg = Jieba.cut ("He came to NetEase hang research building.")# Default is exact modePrint", ". Join (SEG)) seg = Jieba.cut_for_search ("Xiaoming graduated from the Institute of Chinese Academy of Sciences, after studying at Kyoto University in Japan")# search engine modePrint", ". Join (SEG))
Add a custom dictionary
usage : jieba.load_userdict (file_name)
file_name path to a file class object or a custom dictionary
dictionary format : One word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed.
Dictionary Example :
523i3300332000
Example code 2
#!/usr/bin/env python# -*- coding: utf-8 -*-"""功能:结巴分词测试,添加词典时间:2016年5月21日 15:44:24"""import jieba# 添加自定义词典jieba.load_userdict("userdic.txt")seg = jieba.cut("这是一本关于信息检索的书")print"/ ".join(seg)if"__main__": pass
Pos Labeling
Used with Ictclas (NLPIR) compatible labeling method.
words = pseg.cut("这是一本关于信息检索的书")forwordinwords: print (‘%s %s‘ % (word, flag))
Note: Under Anaconda python, the above print error mode is temporarily unknown.
other
- Support Traditional participle
- Keyword extraction
- Parallel participle
- Returns the beginning and end of a word in the original
Reference documents
Official Note: Https://github.com/fxsjy/jieba
Other: 1190000004061791
Stuttering Chinese word segmentation using learning (Python)