Conversion] python jieba (jieba) learning, pythonjieba
OriginalHttp://www.gowhich.com/blog/147theme Chinese Word Segmentation Python
Source code download address: https://github.com/fxsjy/jieba
Demo address: http://jiebademo.ap01.aws.af.cm/
Feature 1: supports three word segmentation modes:
A. Precise mode: it is suitable for Text Analysis to try to cut sentences most accurately;
B. In full mode, all words in a sentence that can be converted into words are scanned. The speed is very fast, but ambiguity cannot be solved;
C. Search Engine mode: Based on the precise mode, the long term is segmented again to improve the recall rate. It is suitable for search engine word segmentation.
2. Supports traditional Chinese Word Segmentation 3, custom dictionary installation 1, and Python 2.x Installation
Fully Automated Installation: Easy_install jieba or pip install jieba
Semi-automatic installation: Download The http://pypi.python.org/pypi/jieba/ first, decompress it, and run python setup. py install
Manual Installation: Place the jieba directory to the current directory or the site-packages directory.
Use import jieba to reference
2. Install Python 3.x
Currently, the master Branch only supports Python2.x.
Python3.x Branch is also basically available: https://github.com/fxsjy/jieba/tree/jieba3k
git clone https://github.com/fxsjy/jieba.gitgit checkout jieba3kpython setup.py install
Algorithm Implementation:
Efficient word Graph Scanning Based on the Trie tree structure to generate directed acyclic graph (DAG) composed of all possible word situations of Chinese characters in a sentence)
Use Dynamic Planning to find the maximum probability path and find the maximum segmentation Combination Based on Word Frequency
For non-Logon words, HMM Model Based on Chinese character-forming capability is used, and Viterbi algorithm is used.
Function 1): Word Segmentation
The jieba. cut Method accepts two input parameters: 1) the first parameter is the string to be segmented; 2) the cut_all parameter is used to control whether full mode is used.
The jieba. cut_for_search method accepts a parameter: string to be segmented. This method is suitable for word segmentation when a search engine constructs an inverted index with a fine granularity.
Note: The string to be segmented can be a gbk string, UTF-8 string, or unicode string.
Jieba. cut and jieba. the structure returned by cut_for_search is an iterative generator. You can use the for loop to obtain every word (unicode) obtained after word segmentation, or use list (jieba. cut (...)) convert to list
Sample Code (Word Segmentation)
# Encoding = utf-8import jiebaseg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = True) print "Full Mode :","/". join (seg_list) # Full mode seg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = False) print "Default Mode :","/". join (seg_list) # exact mode seg_list = jieba. cut ("he has come to Netease hang Yan building") # The default is the accurate mode print ",". join (seg_list) seg_list = jieba. cut_for_search ("James graduated from the Institute of Computing Science of the Chinese Emy of sciences and later studied at Kyoto University in Japan") # search engine mode print ",". join (seg_list)Output:
[Full mode]: I/come/Beijing/Tsinghua University/Huada/University
[Exact mode]: I/come/Beijing/Tsinghua University
[New Word Recognition]: He came, now, Netease, hangyan, and building (here, "hangyan" is not in the dictionary, but it is also recognized by Viterbi algorithms)
[Search Engine mode]: James, Master, graduated from China, science, college, Emy, Chinese Emy of Sciences, Institute of computing, and later, in Japan, Kyoto, university, kyoto University of Japan, further study function 2): Add a custom dictionary
Developers can specify their own custom dictionaries to include words that are not in the jieba dictionary. Although jieba can recognize new words, you can add new words to ensure a higher accuracy.
Usage:
Jieba. load_userdict (file_name) # file_name is the path of the custom dictionary.
Like dict.txt, a word occupies one line. Each line is divided into three parts, one part is a word, the other part is a word frequency, and the last part is a word (can be omitted), separated by spaces.
Example:
Custom dictionary:
Cloud computing 5 Li Xiaofu 2 nr Innovation Office 3 ieasy_install 3 eng easy to use 300 Han Yu rewards 3 nz
Usage example:
# Encoding = utf-8import syssys. path. append (".. /") import jiebajieba. load_userdict ("userdict.txt") import jieba. posseg as export gtest_sent = "Li Xiaofu is the director of the Innovation Office and also an expert in cloud computing;" test_sent + = "for example, I enter a title with" Han Yu rewards, this word is also added to the custom dictionary as N type "words = jieba. cut (test_sent) for w in words: print wresult = drawing G. cut (test_sent) for w in result: print w. word, "/", w. flag, ",", print "\ n ======" terms = jieba. cut ('easy _ install is great ') for t in terms: print tprint' ----------------------- 'terms = jieba. cut ('python regular expression is easy to use ') for t in terms: print tPreviously: Li Xiaofu/Yes/innovation/office/Director/also/Yes/cloud/computing/aspect/expert/
After the custom dictionary is loaded: Li Xiaofu/Yes/Innovation Office/Director/also/Yes/cloud computing/aspect/expert/
"Using custom dictionaries to enhance ambiguity correction capabilities" --- https://github.com/fxsjy/jieba/issues/14function 3): keyword extraction
Jieba. analyze. extract_tags (sentence, topK) # import jieba. analyze first
Description
Setence is the text to be extracted
TopK is the keyword that returns the maximum TF/IDF weight. The default value is 20.
Code example (keyword extraction)
import syssys.path.append('../')import jiebaimport jieba.analysefrom optparse import OptionParserUSAGE = "usage: python extract_tags.py [file name] -k [top k]"parser = OptionParser(USAGE)parser.add_option("-k", dest="topK")opt, args = parser.parse_args()if len(args) < 1: print USAGE sys.exit(1)file_name = args[0]if opt.topK is None: topK = 10else: topK = int(opt.topK)content = open(file_name, 'rb').read()tags = jieba.analyse.extract_tags(content, topK=topK)print ",".join(tags)Function 4): part-of-speech tagging
The part of speech of each word after sentence word segmentation is marked by the mark method compatible with ictclas.
Usage example
>>> Import jieba. posseg as example g >>> words = example G. cut ("I Love Tiananmen Square in Beijing") >>> for w in words :... print w. word, w. flag... r love v Beijing ns Tiananmen Square nsFunction 5): Parallel Word Segmentation
Principle: after the target text is separated by rows, each row of text is allocated to multiple python processes for parallel word segmentation, and then the results are merged to achieve a considerable improvement in word segmentation speed.
The multiprocessing Module Based on python does not currently support windows
Usage:
Jieba. enable_parallel (4) # enable the parallel word splitting mode. The parameter is the number of parallel processes. jieba. disable_parallel () # disable the parallel word splitting mode.
Example:
import urllib2import sys,timeimport syssys.path.append("../../")import jiebajieba.enable_parallel(4)url = sys.argv[1]content = open(url,"rb").read()t1 = time.time()words = list(jieba.cut(content))t2 = time.time()tm_cost = t2-t1log_f = open("1.log","wb")for w in words:print >> log_f, w.encode("utf-8"), "/" ,print 'speed' , len(content)/tm_cost, " bytes/second"Experimental results: On a 4-core 3.4 GHz Linux machine, Jin Yong's complete set is precisely segmented to get a speed of 1 Mbit/s, 3.3 times that of a single-process version. Other dictionaries
Dictionary file https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small with less memory usage
Support for Traditional Word Segmentation better dictionary file https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
Download the required dictionary and overwrite jieba/dict.txt or use jieba. set_dictionary ('data/dict.txt. Day ')
Module initialization Mechanism Change: lazy load (starting from 0.28)
Jieba uses delayed loading. "import jieba" does not trigger dictionary loading immediately. It starts to load dictionary construction trie once necessary. If you want to manually initialize jieba, you can also manually initialize it.
Import jiebajieba. initialize () # manual initialization (optional)
In versions earlier than 0.28, the path of the primary dictionary cannot be specified. With the delayed loading mechanism, you can change the path of the primary dictionary:
jieba.set_dictionary('data/dict.txt.big')Example:
# Encoding = utf-8import syssys. path. append (".. /") import jiebadef cuttest (test_sent): result = jieba. cut (test_sent) print "". join (result) def testcase (): cuttest ("this is a dark night without a finger. My name is Sun Wukong. I love Beijing. I love Python and C ++. ") Cuttest (" I don't like Japanese kimono. ") Cuttest (" the Thunder Monkey returns to the human world. ") Cuttest ") cuttest ("Yonghe clothing & accessories Co., Ltd.") cuttest ("I Love Tiananmen Square, Beijing") cuttest ("abc") cuttest ("Hidden Markov ") cuttest ("Ray monkey is a good website") if _ name _ = "_ main _": testcase () jieba. set_dictionary ("foobar.txt ") print "================================" testcase ()