Problems encountered during the use of the Python jieba Chinese word segmentation tool and Their Solutions,
This article describes the problems encountered during the use of the Python jieba Chinese word segmentation tool and the solutions. We will share this with you for your reference. The details are as follows:
Jieba word segmentation is the best word segmentation tool in Python. Its functions include word segmentation, part-of-speech tagging, keyword extraction, and support for user word lists. I have been studying this tool over the past few days. I have encountered some problems during installation and usage. I will post some of my methods to share them.
Address: https://github.com/fxsjy/jieba
1. installation.
According to the official website, there are three installation methods,
The first is automatic installation: easy_install jieba or pip install jieba, but it is not found to provide this installer.
The second method is semi-automatic installation: first download the http://pypi.python.org/pypi/jieba/, unzip it, and then run python setup. py install in the cmd command line. Note that by default, python commands cannot be directly run in cmd. You must add the path to the environment variable path. I have tried it and it is feasible. However, after this method is installed, the jieba word splitting function can only be used in the built-in Python IDLE. The "import jieba" command cannot be executed in MyEclipse that contains PyDEV, so try the third method.
The third method is to install it manually: place the jieba directory in the current directory or the site-packages directory. Decompress the downloaded jieba-0.30.zip file and copy it to the same location as your Python program. In this way, you can run "import jieba" in the program.
2. Implementation of Word Segmentation
The official website provides basic word segmentation examples:
# Encoding = utf-8import jiebaseg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = True) print "Full Mode :","/". join (seg_list) # Full mode seg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = False) print "Default Mode :","/". join (seg_list) # exact mode seg_list = jieba. cut ("he has come to Netease hang Yan building") # The default is the accurate mode print ",". join (seg_list) seg_list = jieba. cut_for_search ("James graduated from the Institute of Computing Science of the Chinese Emy of sciences and later studied at Kyoto University in Japan") # search engine mode print ",". join (seg_list)
The program can be executed, but in MyEclipse, Chinese is displayed as Unicode encoding, which is not scientific.
However, it is normal to continue executing another part-of-speech tagging example:
Import jieba. posseg as export gwords = export G. cut ("I love Beijing Tiananmen") for w in words: print w. word, w. flag
So I just want to analyze its source code and find it in JBA/_ init _. the cut function in line 1 of The py file (which implements Word Segmentation) contains statement blocks for checking the encoding:
if not isinstance(sentence, unicode): try: sentence = sentence.decode('utf-8') except UnicodeDecodeError: sentence = sentence.decode('gbk','ignore')
This type of code is not found in the cut function of Line 1 in the jieba/posseg/_ init _. py file (which is a part-of-speech tagging file. So I guess the former has a code check and garbled code, while the latter does not check the code and the code is normally displayed. So I will comment out the code of the former check code, when an error was reported during execution of the result program, the source code of the person had to be restored, and the result was displayed in Chinese again!
The running effect is as follows:
The above is just a word segmentation and part-of-speech tagging for fixed Chinese string variables. In the next article, I will try to read Chinese characters from the file for Word Segmentation and part-of-speech tagging.