Transferred from: http://www.gowhich.com/blog/147?utm_source=tuicool&utm_medium=referral
SOURCE Download Address: Https://github.com/fxsjy/jieba
Demo Address: http://jiebademo.ap01.aws.af.cm/
Feature 1, support three kinds of word-breaker mode:
A, accurate mode, try to cut the sentence most accurately, suitable for text analysis;
b, the whole mode, the sentence all can be words of words are scanned out, the speed is very fast, but can not solve the ambiguity;
C, search engine mode, on the basis of accurate mode, the long word again segmentation, improve recall rate, suitable for search engine participle.
2, support traditional word breaker 3, support custom dictionary installation 1,python 2.x
fully automatic installation : Easy_install Jieba or pip install Jieba
semi-automatic installation : Download http://pypi.python.org/pypi/jieba/First, unpack and run Python setup.py install
manual Installation : Place the Jieba directory in the current directory or the Site-packages directory
Refer to by Import Jieba
Installation under 2,python 3.x
The master branch is currently supported only for python2.x
The python3.x version of the branch has also been basically available: https://github.com/fxsjy/jieba/tree/jieba3k
git clone https://github.com/fxsjy/jieba.gitgit checkout jieba3kpython setup.py Install
Algorithm implementation:
Efficient word-map scanning based on trie tree structure to generate a directed acyclic graph (DAG) consisting of all possible words of Chinese characters in a sentence
Using dynamic programming to find the maximum probability path, the maximum segmentation combination based on word frequency is found.
For the non-login words, the HMM model based on Chinese characters ' lexical ability is adopted, and the Viterbi algorithm is used.
function function 1): participle
The Jieba.cut method accepts two input parameters: 1) The first parameter is a string that requires a word breaker 2) The Cut_all parameter is used to control whether full mode is used
The Jieba.cut_for_search method accepts a parameter: a string that needs a word breaker, which is suitable for use in search engine construction of inverted index participle, the granularity is relatively fine
Note: The string to be participle can be a GBK string, a utf-8 string, or a Unicode
The structure returned by Jieba.cut and Jieba.cut_for_search is an iterative generator that can be used for each word (Unicode) that is obtained after a word breaker, or a list (Jieba.cut (...)). Convert to List
code example (Word breaker)
#encoding =utf-8import jiebaseg_list = Jieba.cut ("I came to Tsinghua University in Beijing", cut_all=true) print "Full mode:", "/". Join (seg_list) # Total Mode s Eg_list = Jieba.cut ("I came to Tsinghua University in Beijing", cut_all=false) print "Default mode:", "/". Join (seg_list) # precision Mode seg_list = Jieba.cut ("He came to NetEase Hangzhou Research Building ") # Default is the exact mode print", ". Join (seg_list) seg_list = Jieba.cut_for_search (" Xiao Ming graduated from the Institute of Chinese Academy of Sciences, after studying at Kyoto University in Japan ") # Search engine mode print ",". Join (Seg_list)
Output:
"Full mode": I/Come/BEIJING/Tsinghua/Tsinghua/Huada/University
"Precise mode": I/Come/Beijing/Tsinghua University
"New word recognition": He, came,, NetEase, Hang, building (here, "hang research" is not in the dictionary, but also by the Viterbi algorithm identified)
"Search engine mode": Xiao Ming, MA, graduated from, China, Science, College, Academy of Sciences, Chinese Academy of Sciences, calculation, calculation, after, in, Japan, Kyoto, University, Kyoto University, Japan, Advanced education
Function 2): Add a custom dictionary
Developers can specify their own custom dictionaries to contain words that are not in the Jieba thesaurus. Although Jieba has the ability to recognize new words, adding new words on its own can guarantee a higher rate of correctness.
Usage:
Jieba.load_userdict (file_name) # file_name The path to the custom dictionary
The dictionary format is the same as the dict.txt, one word occupies a line; each line is divided into three parts, part of the word, the other part is the frequency, and finally the speech (can be omitted), separated by a space
Example:
Custom dictionaries:
Cloud computing 5 Li Xiaofu 2 NR Innovation Office 3 ieasy_install 3 Eng useful 300 Han Yu moment opportune 3 NZ
Usage examples:
#encoding =utf-8import syssys.path.append (".. /") Import jiebajieba.load_userdict (" Userdict.txt ") import jieba.posseg as psegtest_sent = "Li Xiaofu is an innovation director and an expert in cloud computing." test_sent += "For example, I entered a title with" Han Yu moment Opportune "and added the word" n "to the Custom Thesaurus (test_sent) for w in words:print wresult = pseg.cut (test_sent) for w in result:print w.word, "/", w.flag, ", ",print "\n========" Terms = jieba.cut (' Easy_ Install is great ') for t in terms: print tprint '---- ---------------------' Terms = jieba.cut (' python 's regular expression is useful ') For t in terms: print t
Before: Li Xiaofu/Yes/innovation/office/Director/also/yes/cloud/calculation/aspect//Expert/
After loading the custom thesaurus: Li Xiaofu/Yes/innovation Office/Director/also/yes/cloud/aspect/expert/
"Enhanced ambiguity correction capability through user-defined dictionaries"---HTTPS://GITHUB.COM/FXSJY/JIEBA/ISSUES/14
function 3): keyword extraction
Jieba.analyse.extract_tags (SENTENCE,TOPK) #需要先import Jieba.analyse
Description
Setence for the text to be extracted
TOPK is the keyword that returns several TF/IDF weights, the default value is 20
code example (keyword extraction)
Import Syssys.path.append ('.. /') Import jiebaimport jieba.analysefrom optparse Import optionparserusage = "Usage:python extract_tags.py [file name]-K [Top K] "parser = Optionparser (USAGE) parser.add_option ("-K ", dest=" TopK ") opt, args = Parser.parse_args () If Len (args) < 1:print USAGE Sys.exit (1) file_name = Args[0]if OPT.TOPK is none:topk = 10ELSE:TOPK = Int (OPT.TOPK) cont ent = open (file_name, ' RB '). Read () tags = jieba.analyse.extract_tags (content, TOPK=TOPK) print ",". Join (Tags)
Function 4): POS tagging
Labeling sentence after word segmentation, using and Ictclas compatible labeling method
Usage examples
>>> Import jieba.posseg as pseg>>> words = Pseg.cut ("I love Beijing Tian ' an door") >>> for W in words: ... print W . Word, W.flag ... I r Love v Beijing NS Tiananmen Square NS
Function 5): Parallel participle
Principle: The target text is separated by lines, the text of each line is assigned to multiple Python process parallel word segmentation, and then merge the results, so as to obtain a significant increase in Word speed
The multiprocessing module, which is based on Python, currently does not support Windows
Usage:
Jieba.enable_parallel (4) # Turn on parallel word breaker, parameter is parallel process number Jieba.disable_parallel () # Turn off parallel word breaker
Example:
Import Urllib2import Sys,timeimport syssys.path.append (".. /.. /") Import Jiebajieba.enable_parallel (4) url = sys.argv[1]content = open (URL," RB "). Read () T1 = Time.time () words = List ( Jieba.cut (content)) t2 = Time.time () tm_cost = T2-t1log_f = Open ("1.log", "WB") for W in Words:print >> Log_f, W.encode ( "Utf-8"), "/", print ' Speed ', len (content)/tm_cost, "Bytes/second"
Experimental results: On the 4-core 3.4GHz Linux machine, the precise participle of the complete works of Jin Yong was obtained, and the speed of 1mb/s was 3.3 times times of the single process version.
Other dictionaries
A dictionary file that consumes less memory https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
Support for traditional word segmentation better dictionary file Https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
Download the dictionary you need, then overwrite jieba/dict.txt or use Jieba.set_dictionary (' Data/dict.txt.big ')
Change of module initialization mechanism: Lazy load (starting from 0.28 version)
Jieba uses lazy loading, "import Jieba" does not immediately trigger the loading of the dictionary and starts loading the dictionary build trie once it is necessary. If you want to manually initialize the Jieba, you can also initialize it manually.
Import jiebajieba.initialize () # Manual initialization (optional)
Before the 0.28 version is unable to specify the path of the main dictionary, after the delay loading mechanism, you can change the path of the main dictionary:
Jieba.set_dictionary (' Data/dict.txt.big ')
Example:
#encoding =utf-8import syssys.path.append (". /") Import Jiebadef cuttest (test_sent): result = Jieba.cut (test_sent) print" ". Join (Result) def testcase (): Cuttest (" It's a pitch night. My name is Monkey king, I love Beijing, I love Python and C + +. Cuttest ("I don't like Japanese kimonos.") Cuttest ("Thunder Monkeys Return to Earth.") "Cuttest (" Letter of the Virgin Officer every month through subordinate departments have to tell the 24-port switch and other technical device installation work ") Cuttest (" I need low-rent housing ") cuttest (" Yonghe Clothing Jewelry Co., Ltd. ") Cuttest (" I love Beijing Tian ' an door ") Cuttest ("abc") Cuttest ("Hidden Markov") Cuttest ("Thunder Monkey is a good site") if __name__ = = "__main__": TestCase () jieba.set_dictionary (" Foobar.txt ") print" ================================ "testcase ()
Stuttering Chinese participle