Python Natural language processing learning--jieba participle

Source: Internet
Author: User
Tags ming

Jieba--"stuttering" Chinese word segmentation is a python Chinese sub-phrase developed by Sunjunyi, you can view the Jieba project on GitHub.

To use Jieba Chinese word segmentation, the first need to install Jieba Chinese word breaker, the author gives the following installation methods :

1. Fully automatic installation: easy_install jieba or pip install jieba /pip3 install jieba

2. Semi-automatic installation: First download http://pypi.python.org/pypi/jieba/, after decompression runpython setup.py install

3. Manual Installation: Place the Jieba directory in the current directory or site-packages directory

The author describes the algorithm used :

1. Efficient word-map scanning based on prefix dictionaries to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence

2. Using dynamic programming to find the maximum probability path, find the maximum segmentation combination based on the word frequency.

3. For the non-login words, using the HMM model based on Chinese characters ' word-forming ability, the VITERBI algorithm is used.

main functions :

1. Participle

   There are two main methods of jieba.cut and jieba.cut_for_search,

Where the Jieba.cut method accepts three input parameters :

1. A string that requires a word breaker;

The 2.cut_all parameter is used to control whether the whole mode is used;

The 3.HMM parameter is used to control whether the HMM model is used

Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=true)

Where the jieba.cut_for_search method accepts two input parameters :

1. A string that requires a word breaker;

2. Whether to use HMM models.

This method is suitable for the search engine to construct the inverted index word segmentation, the granularity is relatively fine

Jieba.cut_for_search ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after studying at Kyoto University in Japan")

Jieba.cut Methods and Jieba.cut The structure returned by the _for_search method is an iterative generatorthat can be used for loops to obtain each word (Unicode)

You can also use the Jieba.lcut method and the jieba.lcut_for_search method to return directly to the list

description of the Author: the string to be participle can be a Unicode or UTF-8 string, GBK string.

Note : It is not recommended to enter the GBK string directly, possibly incorrectly decoded into UTF-8

here are the demo and running results given by the author :

# coding:utf-8#!/usr/bin/env pythonimport Jieba if __name__ = = ' __main__ ': seg_list = Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=True  Print ("Full Mode:" + "/". Join (Seg_list)) #全模式 seg_list = Jieba.cut ("I came to Beijing Tsinghua University", Cut_all=false) print ("Default Mode:" + "/". Join (Seg_list)) #精确模式 seg_list = Jieba.cut ("He came to NetEase Hang Research building") #默认是精确模式 print (",". Join (seg_list)) Seg_list = Jieba.cut_ For_search ("Xiao Ming's Master's degree from the Chinese Academy of Sciences, after studying at Kyoto University in Japan") #搜索引擎模式 print (",". Join (Seg_list))

Results of the output

can see:

  Full mode : try to cut the sentence most precisely, suitable for text analysis , output is all possible word combinations, such as Tsinghua University, will be divided into: Tsinghua, Tsinghua University, Hua da, university

  default model (exact model): All the words in the sentence can be scanned out of the word, very fast, but can not solve ambiguity , such as Tsinghua University, will only export Tsinghua University

  search engine mode : On the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle

There's another way.jieba.Tokenizer(dictionary=DEFAULT_DICT),用于新建自定义分词器,可用于同时使用不同词典。

jieba.dt 为默认分词器,所有全局分词相关函数都是该分词器的映射。

Python Natural language processing learning--jieba participle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.