The use of Chinese word segmentation software (under Python)

Last Update:2014-10-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At present I often use the participle has stuttering participle, nlpir participle and so on

Recently in the use of stuttering participle, a little bit of recommendation, or good use.

First, stuttering participle introduction

Using stuttering participle to Chinese word segmentation, the basic realization principle has three:

Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
For the non-login words, the hmm model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.

Second, installation and use (Linux)

1. Download the toolkit, unzip it into the directory, run: Python setup.py install

Hint:a. A good habit is to download the software, first read the Readme, and then to operate. (No read readme, direct try + Baidu, will go a lot of detours);

B. An error occurred while running the install command: no permission! (Some people may encounter this problem because they have insufficient authority.) Execution: sudo!! WHERE "!!" Represents the previous command, which refers to the above installation command), and can be run normally after using sudo.

2. When using stuttering to do participle, the function that must be used is: Jieba.cut (ARG1,ARG2); This is a function for word segmentation, we only need to understand the following three points, you can use

The A.cut method accepts two input parameters: The first argument (arg1) is a string that needs to be participle, and the arg2 parameter is used to control the word breaker pattern.

The word segmentation pattern is divided into two categories: The default mode, which attempts to cut the sentence most precisely, suitable for text analysis, and a full model that scans all words in a sentence for search engines.

B. The string to be participle can be a GBK string, a utf-8 string, or a Unicode

people using Python pay attention to the coding problem, Python is based on ASCII code to deal with characters, when the occurrence of non-ASCII characters (such as the use of Chinese characters in code), the error message: "ASCII codec can ' t encode character ", the solution is to add a statement at the top of the file: #!-*-coding:utf-8-*-to tell the Python compiler:" This file is encoded with Utf-8, you to decode, please use Utf-8. (Remember, this command must be added to the top of the file, if not at the top, the coding problem is still there, not resolved) about the conversion of the code, you can refer to the blog (PS: Personal understanding "Import sys reload (SYS) Sys.setdefaultencoding (' utf-8 ') "These words with" #! -*-coding:utf-8-*-"equivalent)

The structure returned by C.jieba.cut is an iterative generator that can use a for loop to obtain every word (Unicode) that is obtained after a word breaker, or it can be used with list (Jieba.cut (...)). Convert to List

3. The following example provides a description of the use method provided in Jieba:

#!-*-coding:utf-8-*-Importjiebaseg_list= Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all =True)Print "Full Mode:",' '. Join (seg_list) seg_list= Jieba.cut ("I came to Tsinghua University in Beijing")Print "Default Mode:",' '. Join (Seg_list)

The output is:

Full Mode: I/Come/GO/to/North/Beijing/Jing/Qing/Tsinghua/Tsinghua/Hua/Huada/Big/University/ Learn  Default Mode: I

Iii. other functions of stuttering Chinese participle

1. Add or manage a custom dictionary

Stutter all the dictionary content stored in Dict.txt, you can constantly improve the content of dict.txt.

2. Keyword Extraction

The key words are extracted by calculating the TF/IDF weights of keywords after word segmentation.

The use of Chinese word segmentation software (under Python)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The use of Chinese word segmentation software (under Python)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The use of Chinese word segmentation software (under Python)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support