Stuttering Chinese participle

Last Update:2016-06-05 Source: Internet

Author: User

Tags ming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://www.gowhich.com/blog/147?utm_source=tuicool&utm_medium=referral

SOURCE Download Address: Https://github.com/fxsjy/jieba

Demo Address: http://jiebademo.ap01.aws.af.cm/

Feature 1, support three kinds of word-breaker mode:

A, accurate mode, try to cut the sentence most accurately, suitable for text analysis;
b, the whole mode, the sentence all can be words of words are scanned out, the speed is very fast, but can not solve the ambiguity;
C, search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.

2, support traditional word breaker 3, support custom dictionary installation 1,python 2.x

fully automatic installation : Easy_install Jieba or pip install Jieba
semi-automatic installation : Download http://pypi.python.org/pypi/jieba/First, unpack and run Python setup.py install
manual Installation : Place the Jieba directory in the current directory or the Site-packages directory
Refer to by Import Jieba

Installation under 2,python 3.x

The master branch is currently supported only for python2.x
The python3.x version of the branch has also been basically available: https://github.com/fxsjy/jieba/tree/jieba3k

git clone https://github.com/fxsjy/jieba.gitgit checkout jieba3kpython setup.py Install

Algorithm implementation:

Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
For the non-login words, the HMM model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.

function function 1): participle

The Jieba.cut method accepts two input parameters: 1) The first parameter is a string that requires a word breaker 2) The Cut_all parameter is used to control whether full mode is used
The Jieba.cut_for_search method accepts a parameter: a string that needs a word breaker, which is suitable for use in search engine construction of inverted index participle, the granularity is relatively fine
Note: The string to be participle can be a GBK string, a utf-8 string, or a Unicode
The structure returned by Jieba.cut and Jieba.cut_for_search is an iterative generator that can be used for each word (Unicode) that is obtained after a word breaker, or a list (Jieba.cut (...)). Convert to List
code example (Word breaker)

#encoding =utf-8import jiebaseg_list = Jieba.cut ("I came to Tsinghua University in Beijing", cut_all=true) print "Full mode:", "/". Join (seg_list) # Total Mode s Eg_list = Jieba.cut ("I came to Tsinghua University in Beijing", cut_all=false) print "Default mode:", "/". Join (seg_list) # precision Mode seg_list = Jieba.cut ("He came to NetEase Hangzhou Research Building ") # Default is the exact mode print", ". Join (seg_list) seg_list = Jieba.cut_for_search (" Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after studying at Kyoto University in Japan ") # Search engine mode print ",". Join (Seg_list)

Output:
"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University
"Precise mode": I/Come/Beijing/Tsinghua University
"New word recognition": He, came,, NetEase, Hang, building (here, "hang research" is not in the dictionary, but also by the Viterbi algorithm identified)
"Search engine mode": Xiao Ming, MA, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in, Japan, Kyoto, University, Kyoto University, Japan, Advanced education

Function 2): Add a custom dictionary

Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
Usage:

Jieba.load_userdict (file_name) # file_name The path to the custom dictionary

The dictionary format is the same as the dict.txt, one word occupies a line; each line is divided into three parts, part of the word, the other part is the frequency, and finally the speech (can be omitted), separated by a space
Example:
Custom dictionaries:

Cloud computing 5 Li Xiaofu 2 NR Innovation Office 3 ieasy_install 3 Eng useful 300 Han Yu moment opportune 3 NZ

Usage examples:

 #encoding =utf-8import syssys.path.append (".. /") Import jiebajieba.load_userdict (" Userdict.txt ") import jieba.posseg as psegtest_sent  =  "Li Xiaofu is an innovation director and an expert in cloud computing." test_sent +=  "For example, I entered a title with" Han Yu moment Opportune "and added the word" n "to the Custom Thesaurus (test_sent) for w  in words:print wresult = pseg.cut (test_sent) for w in result:print  w.word,  "/", w.flag,  ", ",print  "\n========" Terms = jieba.cut (' Easy_ Install is great ') for t in terms:    print tprint  '---- ---------------------' Terms = jieba.cut (' python  's regular expression is useful ') For t in terms:     print t

Before: Li Xiaofu/Yes/innovation/office/Director/also/yes/cloud/calculation/aspect//Expert/
After loading the custom thesaurus: Li Xiaofu/Yes/innovation Office/Director/also/yes/cloud/aspect/expert/
"Enhanced ambiguity correction capability through user-defined dictionaries"---HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/14

function 3): keyword extraction

Jieba.analyse.extract_tags (SENTENCE,TOPK) #需要先import Jieba.analyse

Description

Setence for the text to be extracted

TOPK is the keyword that returns several TF/IDF weights, the default value is 20
code example (keyword extraction)

Import Syssys.path.append ('.. /') Import jiebaimport jieba.analysefrom optparse Import optionparserusage = "Usage:python extract_tags.py [file name]-K [Top K] "parser = Optionparser (USAGE) parser.add_option ("-K ", dest=" TopK ") opt, args = Parser.parse_args () If Len (args) < 1:print USAGE Sys.exit (1) file_name = Args[0]if OPT.TOPK is none:topk = 10ELSE:TOPK = Int (OPT.TOPK) cont ent = open (file_name, ' RB '). Read () tags = jieba.analyse.extract_tags (content, TOPK=TOPK) print ",". Join (Tags)

Function 4): POS tagging

Labeling sentence after word segmentation, using and Ictclas compatible labeling method
Usage examples

>>> Import jieba.posseg as pseg>>> words = Pseg.cut ("I love Beijing Tian ' an door") >>> for W in words: ... print W . Word, W.flag ... I r Love v Beijing NS Tiananmen Square NS

Function 5): Parallel participle

Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed
The multiprocessing module, which is based on Python, currently does not support Windows
Usage:

Jieba.enable_parallel (4) # Turn on parallel word breaker, parameter is parallel process number Jieba.disable_parallel () # Turn off parallel word breaker

Example:

Import Urllib2import Sys,timeimport syssys.path.append (".. /.. /") Import Jiebajieba.enable_parallel (4) url = sys.argv[1]content = open (URL," RB "). Read () T1 = Time.time () words = List ( Jieba.cut (content)) t2 = Time.time () tm_cost = T2-t1log_f = Open ("1.log", "WB") for W in Words:print >> Log_f, W.encode ( "Utf-8"), "/", print ' Speed ', len (content)/tm_cost, "Bytes/second"

Experimental results: On the 4-core 3.4GHz Linux machine, the precise participle of the complete works of Jin Yong was obtained, and the speed of 1mb/s was 3.3 times times of the single process version.

Other dictionaries

A dictionary file that consumes less memory https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
Support for traditional word segmentation better dictionary file Https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
Download the dictionary you need, then overwrite jieba/dict.txt or use Jieba.set_dictionary (' Data/dict.txt.big ')

Change of module initialization mechanism: Lazy load (starting from 0.28 version)

Jieba uses lazy loading, "import Jieba" does not immediately trigger the loading of the dictionary and starts loading the dictionary build trie once it is necessary. If you want to manually initialize the Jieba, you can also initialize it manually.

Import jiebajieba.initialize () # Manual initialization (optional)

Before the 0.28 version is unable to specify the path of the main dictionary, after the delay loading mechanism, you can change the path of the main dictionary:

Jieba.set_dictionary (' Data/dict.txt.big ')

Example:

#encoding =utf-8import syssys.path.append (". /") Import Jiebadef cuttest (test_sent): result = Jieba.cut (test_sent) print" ". Join (Result) def testcase (): Cuttest (" It's a pitch night. My name is Monkey king, I love Beijing, I love Python and C + +. Cuttest ("I don't like Japanese kimonos.") Cuttest ("Thunder Monkeys Return to Earth.") "Cuttest (" Letter of the Virgin Officer every month through subordinate departments have to tell the 24-port switch and other technical device installation work ") Cuttest (" I need low-rent housing ") cuttest (" Yonghe Clothing Jewelry Co., Ltd. ") Cuttest (" I love Beijing Tian ' an door ") Cuttest ("abc") Cuttest ("Hidden Markov") Cuttest ("Thunder Monkey is a good site") if __name__ = = "__main__": TestCase () jieba.set_dictionary (" Foobar.txt ") print" ================================ "testcase ()

Stuttering Chinese participle

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More