Stuttering Chinese word segmentation using learning (Python)

Source: Internet
Author: User

Chinese word breaker tool: stuttering participle
GitHub Address: Https://github.com/fxsjy/jieba

Word breaker function
    1. Exact mode (default): Try to cut the sentence most precisely, suitable for text analysis;
    2. Full mode, the sentence all can be the word words are scanned out, but can not solve the ambiguity;
    3. Search engine mode, on the basis of accurate mode, the long word again segmentation, improve the recall rate, this method is suitable for search engine construction Inverted index of word segmentation, granularity is relatively thin.

      Note : The structure returned byJieba.cut and Jieba.cut_for_search is an iterative generator, not a list of lists.

Example code 1
#!/usr/bin/env python#-*-Coding:utf-8-*-"" "function: Stuttering word test, basic participle function time: May 21, 2016 15:44:24 " ""ImportJieba# participle ModeSEG = Jieba.cut ("This is a book on information retrieval.", cut_all=True)# Cut_all=true, full modePrintU "full mode word breaker:"+"/ ". Join (SEG)) seg = Jieba.cut ("This is a book on information retrieval.", cut_all=False)# Cut_all=false, precision modePrintu "exact mode participle:"+"/ ". Join (SEG)) seg = Jieba.cut ("He came to NetEase hang research building.")# Default is exact modePrint", ". Join (SEG)) seg = Jieba.cut_for_search ("Xiaoming graduated from the Institute of Chinese Academy of Sciences, after studying at Kyoto University in Japan")# search engine modePrint", ". Join (SEG))
Add a custom dictionary

usage : jieba.load_userdict (file_name)
file_name path to a file class object or a custom dictionary
dictionary format : One word occupies a line; each line is divided into three parts: words, Word frequency (can be omitted), part of speech (can be omitted), separated by a space, the order can not be reversed.
Dictionary Example :

523i3300332000
Example code 2
#!/usr/bin/env python# -*- coding: utf-8 -*-"""功能:结巴分词测试,添加词典时间:2016年5月21日 15:44:24"""import jieba# 添加自定义词典jieba.load_userdict("userdic.txt")seg = jieba.cut("这是一本关于信息检索的书")print"/ ".join(seg)if"__main__":    pass
Pos Labeling

Used with Ictclas (NLPIR) compatible labeling method.

words = pseg.cut("这是一本关于信息检索的书")forwordinwords:    print (‘%s %s‘ % (word, flag))

Note: Under Anaconda python, the above print error mode is temporarily unknown.

other
    1. Support Traditional participle
    2. Keyword extraction
    3. Parallel participle
    4. Returns the beginning and end of a word in the original

Reference documents
Official Note: Https://github.com/fxsjy/jieba
Other: 1190000004061791

Stuttering Chinese word segmentation using learning (Python)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.