Problems encountered during the use of the Python jieba Chinese word segmentation tool and Their Solutions,

Last Update:2017-04-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes the problems encountered during the use of the Python jieba Chinese word segmentation tool and the solutions. We will share this with you for your reference. The details are as follows:

Jieba word segmentation is the best word segmentation tool in Python. Its functions include word segmentation, part-of-speech tagging, keyword extraction, and support for user word lists. I have been studying this tool over the past few days. I have encountered some problems during installation and usage. I will post some of my methods to share them.

Address: https://github.com/fxsjy/jieba

1. installation.

According to the official website, there are three installation methods,

The first is automatic installation: easy_install jieba or pip install jieba, but it is not found to provide this installer.

The second method is semi-automatic installation: first download the http://pypi.python.org/pypi/jieba/, unzip it, and then run python setup. py install in the cmd command line. Note that by default, python commands cannot be directly run in cmd. You must add the path to the environment variable path. I have tried it and it is feasible. However, after this method is installed, the jieba word splitting function can only be used in the built-in Python IDLE. The "import jieba" command cannot be executed in MyEclipse that contains PyDEV, so try the third method.

The third method is to install it manually: place the jieba directory in the current directory or the site-packages directory. Decompress the downloaded jieba-0.30.zip file and copy it to the same location as your Python program. In this way, you can run "import jieba" in the program.

2. Implementation of Word Segmentation

The official website provides basic word segmentation examples:

# Encoding = utf-8import jiebaseg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = True) print "Full Mode :","/". join (seg_list) # Full mode seg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = False) print "Default Mode :","/". join (seg_list) # exact mode seg_list = jieba. cut ("he has come to Netease hang Yan building") # The default is the accurate mode print ",". join (seg_list) seg_list = jieba. cut_for_search ("James graduated from the Institute of Computing Science of the Chinese Emy of sciences and later studied at Kyoto University in Japan") # search engine mode print ",". join (seg_list)

The program can be executed, but in MyEclipse, Chinese is displayed as Unicode encoding, which is not scientific.

However, it is normal to continue executing another part-of-speech tagging example:

Import jieba. posseg as export gwords = export G. cut ("I love Beijing Tiananmen") for w in words: print w. word, w. flag

So I just want to analyze its source code and find it in JBA/_ init _. the cut function in line 1 of The py file (which implements Word Segmentation) contains statement blocks for checking the encoding:

if not isinstance(sentence, unicode):  try:   sentence = sentence.decode('utf-8')  except UnicodeDecodeError:   sentence = sentence.decode('gbk','ignore')

This type of code is not found in the cut function of Line 1 in the jieba/posseg/_ init _. py file (which is a part-of-speech tagging file. So I guess the former has a code check and garbled code, while the latter does not check the code and the code is normally displayed. So I will comment out the code of the former check code, when an error was reported during execution of the result program, the source code of the person had to be restored, and the result was displayed in Chinese again!

The running effect is as follows:

The above is just a word segmentation and part-of-speech tagging for fixed Chinese string variables. In the next article, I will try to read Chinese characters from the file for Word Segmentation and part-of-speech tagging.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Problems encountered during the use of the Python jieba Chinese word segmentation tool and Their Solutions,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Problems encountered during the use of the Python jieba Chinese word segmentation tool and Their Solutions,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support