The use of Chinese word segmentation software (under Python)

Source: Internet
Author: User

At present I often use the participle has stuttering participle, nlpir participle and so on

Recently in the use of stuttering participle, a little bit of recommendation, or good use.

First, stuttering participle introduction

Using stuttering participle to Chinese word segmentation, the basic realization principle has three:

    1. Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
    2. Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
    3. For the non-login words, the hmm model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.

Second, installation and use (Linux)

1. Download the toolkit, unzip it into the directory, run: Python setup.py install

Hint:a. A good habit is to download the software, first read the Readme, and then to operate. (No read readme, direct try + Baidu, will go a lot of detours);

B. An error occurred while running the install command: no permission! (Some people may encounter this problem because they have insufficient authority.)   Execution: sudo!! WHERE "!!" Represents the previous command, which refers to the above installation command), and can be run normally after using sudo.

2. When using stuttering to do participle, the function that must be used is: Jieba.cut (ARG1,ARG2); This is a function for word segmentation, we only need to understand the following three points, you can use

The A.cut method accepts two input parameters: The first argument (arg1) is a string that needs to be participle, and the arg2 parameter is used to control the word breaker pattern.

The word segmentation pattern is divided into two categories: The default mode, which attempts to cut the sentence most precisely, suitable for text analysis, and a full model that scans all words in a sentence for search engines.

B. The string to be participle can be a GBK string, a utf-8 string, or a Unicode

people using Python pay attention to the coding problem, Python is based on ASCII code to deal with characters, when the occurrence of non-ASCII characters (such as the use of Chinese characters in code), the error message: "ASCII codec can ' t encode character ", the solution is to add a statement at the top of the file: #!-*-coding:utf-8-*-to tell the Python compiler:" This file is encoded with Utf-8, you to decode, please use Utf-8.   (Remember, this command must be added to the top of the file, if not at the top, the coding problem is still there, not resolved) about the conversion of the code, you can refer to the blog (PS: Personal understanding "Import sys reload (SYS) Sys.setdefaultencoding (' utf-8 ') "These words with" #! -*-coding:utf-8-*-"equivalent)

The structure returned by C.jieba.cut is an iterative generator that can use a for loop to obtain every word (Unicode) that is obtained after a word breaker, or it can be used with list (Jieba.cut (...)). Convert to List

3. The following example provides a description of the use method provided in Jieba:

#!-*-coding:utf-8-*-Importjiebaseg_list= Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all =True)Print "Full Mode:",' '. Join (seg_list) seg_list= Jieba.cut ("I came to Tsinghua University in Beijing")Print "Default Mode:",' '. Join (Seg_list)

The output is:

Full Mode: I/Come/GO/to/North/Beijing/Jing/Qing/Tsinghua/Tsinghua/Hua/Huada/Big/University/ Learn  Default Mode: I

Iii. other functions of stuttering Chinese participle

1. Add or manage a custom dictionary

Stutter all the dictionary content stored in Dict.txt, you can constantly improve the content of dict.txt.

2. Keyword Extraction

The key words are extracted by calculating the TF/IDF weights of keywords after word segmentation.

The use of Chinese word segmentation software (under Python)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.