Python Natural language processing learning--jieba participle

Last Update:2017-01-19 Source: Internet

Author: User

Tags ming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jieba--"stuttering" Chinese word segmentation is a python Chinese sub-phrase developed by Sunjunyi, you can view the Jieba project on GitHub.

To use Jieba Chinese word segmentation, the first need to install Jieba Chinese word breaker, the author gives the following installation methods :

1. Fully automatic installation: easy_install jieba or pip install jieba /pip3 install jieba

2. Semi-automatic installation: First download http://pypi.python.org/pypi/jieba/, after decompression runpython setup.py install

3. Manual Installation: Place the Jieba directory in the current directory or site-packages directory

The author describes the algorithm used :

1. Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence

2. Using dynamic programming to find the maximum probability path, find the maximum segmentation combination based on the word frequency.

3. For the non-login words, using the HMM model based on Chinese characters ' word-forming ability, the VITERBI algorithm is used.

main functions :

1. Participle

　　 There are two main methods of jieba.cut and jieba.cut_for_search,

Where the Jieba.cut method accepts three input parameters :

1. A string that requires a word breaker;

The 2.cut_all parameter is used to control whether the whole mode is used;

The 3.HMM parameter is used to control whether the HMM model is used

Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=true)

Where the jieba.cut_for_search method accepts two input parameters :

1. A string that requires a word breaker;

2. Whether to use HMM models.

This method is suitable for the search engine to construct the inverted index word segmentation, the granularity is relatively fine

Jieba.cut_for_search ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after studying at Kyoto University in Japan")

Jieba.cut Methods and Jieba.cut The structure returned by the _for_search method is an iterative generatorthat can be used for loops to obtain each word (Unicode)

You can also use the Jieba.lcut method and the jieba.lcut_for_search method to return directly to the list

description of the Author: the string to be participle can be a Unicode or UTF-8 string, GBK string.

Note : It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8

here are the demo and running results given by the author :

# coding:utf-8#!/usr/bin/env pythonimport Jieba if __name__ = = ' __main__ ': seg_list = Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=True  Print ("Full Mode:" + "/". Join (Seg_list)) #全模式 seg_list = Jieba.cut ("I came to Beijing Tsinghua University", Cut_all=false) print ("Default Mode:" + "/". Join (Seg_list)) #精确模式 seg_list = Jieba.cut ("He came to NetEase Hang Research building") #默认是精确模式 print (",". Join (seg_list)) Seg_list = Jieba.cut_ For_search ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after studying at Kyoto University in Japan") #搜索引擎模式 print (",". Join (Seg_list))

Results of the output

can see:

　　Full mode : try to cut the sentence most precisely, suitable for text analysis , output is all possible word combinations, such as Tsinghua University, will be divided into: Tsinghua, Tsinghua University, Hua da, university

　　default model (exact model): All the words in the sentence can be scanned out of the word, very fast, but can not solve ambiguity , such as Tsinghua University, will only export Tsinghua University

　　search engine mode : On the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle

There's another way.jieba.Tokenizer(dictionary=DEFAULT_DICT)，用于新建自定义分词器，可用于同时使用不同词典。

jieba.dt 为默认分词器，所有全局分词相关函数都是该分词器的映射。

Python Natural language processing learning--jieba participle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More