Conversion] python jieba (jieba) learning, pythonjieba

Source: Internet
Author: User

Conversion] python jieba (jieba) learning, pythonjieba
OriginalHttp://www.gowhich.com/blog/147theme Chinese Word Segmentation Python

Source code download address: https://github.com/fxsjy/jieba

Demo address: http://jiebademo.ap01.aws.af.cm/

Feature 1: supports three word segmentation modes:

A. Precise mode: it is suitable for Text Analysis to try to cut sentences most accurately;
B. In full mode, all words in a sentence that can be converted into words are scanned. The speed is very fast, but ambiguity cannot be solved;
C. Search Engine mode: Based on the precise mode, the long term is segmented again to improve the recall rate. It is suitable for search engine word segmentation.

2. Supports traditional Chinese Word Segmentation 3, custom dictionary installation 1, and Python 2.x Installation

Fully Automated Installation: Easy_install jieba or pip install jieba
Semi-automatic installation: Download The http://pypi.python.org/pypi/jieba/ first, decompress it, and run python setup. py install
Manual Installation: Place the jieba directory to the current directory or the site-packages directory.
Use import jieba to reference

2. Install Python 3.x

Currently, the master Branch only supports Python2.x.
Python3.x Branch is also basically available: https://github.com/fxsjy/jieba/tree/jieba3k

git clone https://github.com/fxsjy/jieba.gitgit checkout jieba3kpython setup.py install
Algorithm Implementation:

Efficient word Graph Scanning Based on the Trie tree structure to generate directed acyclic graph (DAG) composed of all possible word situations of Chinese characters in a sentence)
Use Dynamic Planning to find the maximum probability path and find the maximum segmentation Combination Based on Word Frequency
For non-Logon words, HMM Model Based on Chinese character-forming capability is used, and Viterbi algorithm is used.

Function 1): Word Segmentation

The jieba. cut Method accepts two input parameters: 1) the first parameter is the string to be segmented; 2) the cut_all parameter is used to control whether full mode is used.
The jieba. cut_for_search method accepts a parameter: string to be segmented. This method is suitable for word segmentation when a search engine constructs an inverted index with a fine granularity.
Note: The string to be segmented can be a gbk string, UTF-8 string, or unicode string.
Jieba. cut and jieba. the structure returned by cut_for_search is an iterative generator. You can use the for loop to obtain every word (unicode) obtained after word segmentation, or use list (jieba. cut (...)) convert to list
Sample Code (Word Segmentation)

# Encoding = utf-8import jiebaseg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = True) print "Full Mode :","/". join (seg_list) # Full mode seg_list = jieba. cut ("I came to Beijing Tsinghua University", cut_all = False) print "Default Mode :","/". join (seg_list) # exact mode seg_list = jieba. cut ("he has come to Netease hang Yan building") # The default is the accurate mode print ",". join (seg_list) seg_list = jieba. cut_for_search ("James graduated from the Institute of Computing Science of the Chinese Emy of sciences and later studied at Kyoto University in Japan") # search engine mode print ",". join (seg_list)
Output:
[Full mode]: I/come/Beijing/Tsinghua University/Huada/University
[Exact mode]: I/come/Beijing/Tsinghua University
[New Word Recognition]: He came, now, Netease, hangyan, and building (here, "hangyan" is not in the dictionary, but it is also recognized by Viterbi algorithms)
[Search Engine mode]: James, Master, graduated from China, science, college, Emy, Chinese Emy of Sciences, Institute of computing, and later, in Japan, Kyoto, university, kyoto University of Japan, further study function 2): Add a custom dictionary

Developers can specify their own custom dictionaries to include words that are not in the jieba dictionary. Although jieba can recognize new words, you can add new words to ensure a higher accuracy.
Usage:

Jieba. load_userdict (file_name) # file_name is the path of the custom dictionary.
Like dict.txt, a word occupies one line. Each line is divided into three parts, one part is a word, the other part is a word frequency, and the last part is a word (can be omitted), separated by spaces.
Example:
Custom dictionary:
Cloud computing 5 Li Xiaofu 2 nr Innovation Office 3 ieasy_install 3 eng easy to use 300 Han Yu rewards 3 nz
Usage example:
# Encoding = utf-8import syssys. path. append (".. /") import jiebajieba. load_userdict ("userdict.txt") import jieba. posseg as export gtest_sent = "Li Xiaofu is the director of the Innovation Office and also an expert in cloud computing;" test_sent + = "for example, I enter a title with" Han Yu rewards, this word is also added to the custom dictionary as N type "words = jieba. cut (test_sent) for w in words: print wresult = drawing G. cut (test_sent) for w in result: print w. word, "/", w. flag, ",", print "\ n ======" terms = jieba. cut ('easy _ install is great ') for t in terms: print tprint' ----------------------- 'terms = jieba. cut ('python regular expression is easy to use ') for t in terms: print t
Previously: Li Xiaofu/Yes/innovation/office/Director/also/Yes/cloud/computing/aspect/expert/
After the custom dictionary is loaded: Li Xiaofu/Yes/Innovation Office/Director/also/Yes/cloud computing/aspect/expert/
"Using custom dictionaries to enhance ambiguity correction capabilities" --- https://github.com/fxsjy/jieba/issues/14function 3): keyword extraction
Jieba. analyze. extract_tags (sentence, topK) # import jieba. analyze first

Description

Setence is the text to be extracted

TopK is the keyword that returns the maximum TF/IDF weight. The default value is 20.
Code example (keyword extraction)

import syssys.path.append('../')import jiebaimport jieba.analysefrom optparse import OptionParserUSAGE = "usage: python extract_tags.py [file name] -k [top k]"parser = OptionParser(USAGE)parser.add_option("-k", dest="topK")opt, args = parser.parse_args()if len(args) < 1:    print USAGE    sys.exit(1)file_name = args[0]if opt.topK is None:    topK = 10else:    topK = int(opt.topK)content = open(file_name, 'rb').read()tags = jieba.analyse.extract_tags(content, topK=topK)print ",".join(tags)
Function 4): part-of-speech tagging

The part of speech of each word after sentence word segmentation is marked by the mark method compatible with ictclas.
Usage example

>>> Import jieba. posseg as example g >>> words = example G. cut ("I Love Tiananmen Square in Beijing") >>> for w in words :... print w. word, w. flag... r love v Beijing ns Tiananmen Square ns
Function 5): Parallel Word Segmentation

Principle: after the target text is separated by rows, each row of text is allocated to multiple python processes for parallel word segmentation, and then the results are merged to achieve a considerable improvement in word segmentation speed.
The multiprocessing Module Based on python does not currently support windows
Usage:

Jieba. enable_parallel (4) # enable the parallel word splitting mode. The parameter is the number of parallel processes. jieba. disable_parallel () # disable the parallel word splitting mode.
Example:
import urllib2import sys,timeimport syssys.path.append("../../")import jiebajieba.enable_parallel(4)url = sys.argv[1]content = open(url,"rb").read()t1 = time.time()words = list(jieba.cut(content))t2 = time.time()tm_cost = t2-t1log_f = open("1.log","wb")for w in words:print >> log_f, w.encode("utf-8"), "/" ,print 'speed' , len(content)/tm_cost, " bytes/second"
Experimental results: On a 4-core 3.4 GHz Linux machine, Jin Yong's complete set is precisely segmented to get a speed of 1 Mbit/s, 3.3 times that of a single-process version. Other dictionaries

Dictionary file https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small with less memory usage
Support for Traditional Word Segmentation better dictionary file https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
Download the required dictionary and overwrite jieba/dict.txt or use jieba. set_dictionary ('data/dict.txt. Day ')

Module initialization Mechanism Change: lazy load (starting from 0.28)

Jieba uses delayed loading. "import jieba" does not trigger dictionary loading immediately. It starts to load dictionary construction trie once necessary. If you want to manually initialize jieba, you can also manually initialize it.

Import jiebajieba. initialize () # manual initialization (optional)
In versions earlier than 0.28, the path of the primary dictionary cannot be specified. With the delayed loading mechanism, you can change the path of the primary dictionary:
jieba.set_dictionary('data/dict.txt.big')
Example:
# Encoding = utf-8import syssys. path. append (".. /") import jiebadef cuttest (test_sent): result = jieba. cut (test_sent) print "". join (result) def testcase (): cuttest ("this is a dark night without a finger. My name is Sun Wukong. I love Beijing. I love Python and C ++. ") Cuttest (" I don't like Japanese kimono. ") Cuttest (" the Thunder Monkey returns to the human world. ") Cuttest ") cuttest ("Yonghe clothing & accessories Co., Ltd.") cuttest ("I Love Tiananmen Square, Beijing") cuttest ("abc") cuttest ("Hidden Markov ") cuttest ("Ray monkey is a good website") if _ name _ = "_ main _": testcase () jieba. set_dictionary ("foobar.txt ") print "================================" testcase ()

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.