Chinese word breaker tool Jieba

Source: Internet
Author: User
Tags ming

SOURCE Download Address: Https://github.com/fxsjy/jieba

Demo Address: http://jiebademo.ap01.aws.af.cm/

Characteristics

1, support three kinds of word-breaker mode:

A, accurate mode, try to cut the sentence most accurately, suitable for text analysis;
b, the whole mode, the sentence all can be words of words are scanned out, the speed is very fast, but can not solve the ambiguity;
C, search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.

2, support traditional participle

3, support custom dictionaries

Installation

Installation under 1,python 2.x

Fully automatic installation: Easy_install Jieba or pip install Jieba
Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/first, unpack and run Python setup.py install
Manual Installation: Place the Jieba directory in the current directory or the Site-packages directory
Refer to by Import Jieba

Installation under 2,python 3.x

The master branch is currently supported only for python2.x
The python3.x version of the branch has also been basically available: https://github.com/fxsjy/jieba/tree/jieba3k

git clone https://github.com/fxsjy/jieba.git
git checkout jieba3k
Python setup.py Install


Algorithm implementation:

Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
For the non-login words, the HMM model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.


Function

Function 1): participle

The Jieba.cut method accepts two input parameters: 1) The first parameter is a string that requires a word breaker 2) The Cut_all parameter is used to control whether full mode is used
The Jieba.cut_for_search method accepts a parameter: a string that needs a word breaker, which is suitable for use in search engine construction of inverted index participle, the granularity is relatively fine
Note: The string to be participle can be a GBK string, a utf-8 string, or a Unicode
The structure returned by Jieba.cut and Jieba.cut_for_search is an iterative generator that can be used for each word (Unicode) that is obtained after a word breaker, or a list (Jieba.cut (...)). Convert to List
code example (Word breaker)

#encoding =utf-8
Import Jieba
Seg_list = Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=true)
Print "Full mode:", "/". Join (Seg_list) # all modes
Seg_list = Jieba.cut ("I came to Tsinghua University in Beijing", Cut_all=false)
Print "Default mode:", "/". Join (seg_list) # precision Mode
Seg_list = Jieba.cut ("He came to NetEase Hang Research building") # Default is the exact mode
Print ",". Join (Seg_list)
Seg_list = Jieba.cut_for_search ("Xiao Ming Master graduated from the Institute of Chinese Academy of Sciences, after studying at Kyoto University in Japan") # Search engine mode
Print ",". Join (Seg_list)
Output:
"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University
"Precise mode": I/Come/Beijing/Tsinghua University
"New word recognition": He, came,, NetEase, Hang, building (here, "hang research" is not in the dictionary, but also by the Viterbi algorithm identified)
"Search engine mode": Xiao Ming, MA, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in, Japan, Kyoto, University, Kyoto University, Japan, Advanced education


Function 2): Add a custom dictionary

Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
Usage:

Jieba.load_userdict (file_name) # file_name The path to the custom dictionary
The dictionary format is the same as the dict.txt, one word occupies a line; each line is divided into three parts, part of the word, the other part is the frequency, and finally the speech (can be omitted), separated by a space
Example:
Custom dictionaries:
Cloud Computing 5
Li Xiaofu 2 NR
Innovation Office 3 I
Easy_install 3 Eng
Useful 300
Han Yu moment opportune 3 NZ
Usage examples:
#encoding =utf-8
Import Sys
Sys.path.append (".. /")
Import Jieba
Jieba.load_userdict ("Userdict.txt")
Import Jieba.posseg as Pseg

Test_sent = "Li Xiaofu is the director of innovation and a cloud computing expert;"
Test_sent + = "For example, I entered a title with" Han Yu moment Opportune "and added the word" n "to the custom thesaurus.
Words = Jieba.cut (test_sent)
For W in words:
Print W

result = Pseg.cut (test_sent)

For W in Result:
Print W.word, "/", W.flag, ",",

Print "\n========"

terms = jieba.cut (' Easy_install is great ')
For T in terms:
Print T
print '-------------------------'
terms = jieba.cut (' Python regular expressions are easy to use ')
For T in terms:
Print T
Before: Li Xiaofu/Yes/innovation/office/Director/also/yes/cloud/calculation/aspect//Expert/
After loading the custom thesaurus: Li Xiaofu/Yes/innovation Office/Director/also/yes/cloud/aspect/expert/
"Enhanced ambiguity correction capability through user-defined dictionaries"---HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/14


function 3): keyword extraction
Jieba.analyse.extract_tags (SENTENCE,TOPK) #需要先import Jieba.analyse
Description

Setence for the text to be extracted

TOPK is the keyword that returns several TF/IDF weights, the default value is 20
code example (keyword extraction)

Import Sys
Sys.path.append ('.. /‘)

Import Jieba
Import Jieba.analyse
From Optparse import Optionparser

USAGE = "Usage:python extract_tags.py [file name]-K [Top K]"

Parser = Optionparser (USAGE)
Parser.add_option ("-K", dest= "TopK")
opt, args = Parser.parse_args ()

If Len (args) < 1:
Print USAGE
Sys.exit (1)

file_name = Args[0]

If OPT.TOPK is None:
TopK = 10
Else
TopK = Int (OPT.TOPK)

Content = open (file_name, ' RB '). Read ()

tags = jieba.analyse.extract_tags (content, TOPK=TOPK)

Print ",". Join (Tags)


Function 4): POS tagging

Labeling sentence after word segmentation, using and Ictclas compatible labeling method
Usage examples

>>> Import jieba.posseg as PSEG
>>> words = Pseg.cut ("I love Beijing Tian ' an gate")
>>> for W in words:
... print W.word, W.flag
...
Me r
Love V
BEIJING NS
Tiananmen NS


Function 5): Parallel participle

Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed
The multiprocessing module, which is based on Python, currently does not support Windows
Usage:

Jieba.enable_parallel (4) # Open parallel word-breaker with parameters of parallel process
Jieba.disable_parallel () # Turn off parallel participle mode
Example:
Import Urllib2
Import Sys,time
Import Sys
Sys.path.append (".. /.. /")
Import Jieba
Jieba.enable_parallel (4)

url = sys.argv[1]
Content = open (URL, "RB"). Read ()
T1 = Time.time ()
Words = List (Jieba.cut (content))

T2 = Time.time ()
Tm_cost = T2-t1

Log_f = open ("1.log", "WB")
For W in words:
Print >> Log_f, W.encode ("Utf-8"), "/",

print ' Speed ', len (content)/tm_cost, "Bytes/second"
Experimental results: On the 4-core 3.4GHz Linux machine, the precise participle of the complete works of Jin Yong was obtained, and the speed of 1mb/s was 3.3 times times of the single process version.
Other dictionaries

A dictionary file that consumes less memory https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
Support for traditional word segmentation better dictionary file Https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
Download the dictionary you need, then overwrite jieba/dict.txt or use Jieba.set_dictionary (' Data/dict.txt.big ')

Chinese word breaker tool Jieba

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.