reprint: http://www.cnblogs.com/jiayongji/p/7119065.html
Stuttering is a powerful word-breaker.
Installing Jieba
pip install jieba
Simple usage
Stuttering participle is divided into three modes: exact mode (default), Full mode and search engine mode, the following are examples of these three modes:
Precision Mode
import jiebas = u‘我想和女朋友一起去北京故宫博物院参观和闲逛。‘
= jieba.cut(s)print ‘【Output】‘print cutprint ‘,‘.join(cut)
【Output】<generator object cut at 0x7f8dbc0efc30>我,想,和,女朋友,一起,去,北京故宫博物院,参观,和,闲逛,。
The result of a visible word breaker is a generator (which is especially important for segmenting large data volumes).
Full mode
print ‘【Output】‘print ‘,‘.join(jieba.cut(s,cut_all = True))
【Output】我,想,和,女朋友,朋友,一起,去,北京,北京故宫,北京故宫博物院,故宫,故宫博物院,博物,博物院,参观,和,闲逛,,
The whole pattern is to divide the text into as many words as possible.
Search engine mode
print ‘【Output】‘print ‘,‘.join(jieba.cut_for_search(s))
【Output】我,想,和,朋友,女朋友,一起,去,北京,故宫,博物,博物院,北京故宫博物院,参观,和,闲逛,。
Get part of speech
Each word has its part of speech, such as nouns, verbs, pronouns, etc., the result of stuttering participle can also take each word's part of speech, to use the jieba.posseg, for example, as follows:
Import jieba.possegas psgprint "Output" ' print [(X.word , X.flag) for x in Psg.cut (s)]# output: [(U ' i ', U ' r '), (U ' want ', U ' V '), (U ' and ', U ' C '), (U ' girlfriend ', U ') n '), (U ' together ', U ' m '), (U ' go ', U ' V '), (U ' Beijing Palace Museum ', U ' NS '), (U ' visit ', u ' n '), (U ' and ', U ' C '), (U ' loitering ', U ' V '), (U '). ', U ' x ')] "
You can see the successful acquisition of the word of speech, which is useful for us to further deal with the results of the word segmentation, such as just want to get the word in the results list, then you can filter:
print [(x.word,x.flag) for x in psg.cut(s) if x.flag.startswith(‘n‘)]# 输出:‘‘‘[(u‘女朋友‘, u‘n‘), (u‘北京故宫博物院‘, u‘ns‘), (u‘参观‘, u‘n‘)]‘‘‘
As for the part of speech of each letter to indicate what part of speech, Jieba participle of the results may have what part of speech, it is necessary to consult the part of the list of parts of speech, the end of this article with a copy from the online word of speech, want to know more detailed classification information, can go online search "stuttering participle of speech control."
Parallel participle
When the amount of text data is very large, in order to improve the efficiency of word segmentation, it is necessary to open parallel participle. Jieba supports parallel participle, based on Python's own multiprocessing module, but note that it is not supported in Windows environments.
Usage:
# 开启并行分词模式,参数为并发执行的进程数jieba.enable_parallel(5)# 关闭并行分词模式jieba.disable_parallel()
Example: turning on parallel word segmentation to Word segmentation of three-body complete text
= open(‘./santi.txt‘).read()print len(santi_text)
2681968
You can see that the amount of data in the complete three is still very large, with a length of more than 2.6 million bytes.
jieba.enable_parallel(100)santi_words = [x for x in jieba.cut(santi_text) if len(x) >= 2]jieba.disable_parallel()
Gets the word with the top n occurrence frequency
For example, if you want to get a list of the first 20 words in the word segmentation results, you can get the following:
From collectionsImport Counterc= Counter (santi_words). Most_common (20) print c# output: " [(U ' \ r \ n ', 21805), (U ' one ', 3057), (U ' no ', 2128), (U ' they ', 1690), (U ' Us ', 1550), (U ' this ', 1357), (U ' own ', 1347), (U ' Cheng ', 1320), (U ' Now ', 1273), (U ' already ', 1259), (U ' world ', 1243), (U ' rom ', 1189), (U ' may ', 1177), (U ' what ', 1176), (U ' see ', 1114), (U ' know ', 1094), (U ' earth ', 951), (U ' Human ', 935 ), (U ' space ', 930), (U ' tri-Body ', 883)]
You can see the results of ' \ r \ n ' actually appear the most frequent words, there are ' one ', ' no ', ' this ' and so on we do not want the meaningless words, then can be based on the word of speech in front of the filter, this later fine talk.
Use user dictionaries to improve word segmentation accuracy
Word breaker results that do not use a user dictionary:
= u‘欧阳建国是创新办主任也是欢聚时代公司云计算方面的专家‘print ‘,‘.join(jieba.cut(txt))
欧阳,建国,是,创新,办,主任,也,是,欢聚,时代,公司,云,计算,方面,的,专家
Use the word breaker result for the user dictionary:
jieba.load_userdict(‘user_dict.txt‘)print ‘,‘.join(jieba.cut(txt))
欧阳建国,是,创新办,主任,也,是,欢聚时代,公司,云计算,方面,的,专家
You can see that the accuracy of word segmentation is greatly improved when using user dictionaries.
Note: The contents of User_dict.txt are as follows:
Ouyang Jianguo 5
Innovation Office 5 I
Time to gather 5
Cloud Computing 5
The user dictionary has one word per line, in the form:
Word frequency, part of speech
Where the word frequency is a number, part of speech for the custom part of speech, to note that the word frequency numbers and spaces are half-width.
Attached: Stuttering word of speech comparison table (alphabetically alphabetical order) adjectives (a class, 4 two classes)
A adjective
Ad sub-type word
An noun
adjective morpheme of AG
Al adjective idiomatic language
Distinguishing words (one class, 2 two classes)
b Distinguishing Words
BL distinguishes the idiomatic phrase of speech
Conjunctions (one category, one class two)
C conjunctions
CC parallel conjunctions
Adverbs (one Class)
D adverb
Interjection (one Class)
E interjection
Nouns of locality (one class)
f noun
Prefix (one class)
H prefix
Suffix (one Class)
K suffix
Numerals (one category, one class two)
M numerals
MQ number of words
Nouns (one class, 7 two classes, 5 three categories)
Nouns are divided into the following sub-categories:
n noun
NR Name
NR1 Chinese surname
NR2 Chinese name
NRJ Japanese names
NRF transliteration of names
NS Place Names
NSF Transliteration of place names
NT Institution Group name
NZ other proper names
NL noun Idiomatic language
ng noun morpheme
Quasi-Sound words (one Class)
o Quasi-sound words
Prepositions (one class, 2 two classes)
P Prepositions
PBA preposition "put"
Pbei preposition "by"
Quantifier (one class, 2 two classes)
Q quantifier
QV Moving quantifiers
QT Time quantifier
Pronouns (one class, 4 two classes, 6 classes)
R pronoun
RR Personal pronouns
RZ demonstrative pronoun
Rzt Time demonstrative pronoun
Rzs Quarter demonstrative pronoun
RZV predicate pronoun of part of speech
Ry interrogative pronouns
Ryt Time interrogative pronoun
Rys Quarter interrogative pronoun
RYV predicate interrogative pronoun of part of speech
RG Generation of speech morphemes
Premises Words (one class)
S quarter Word
Time words (a class, a class two)
T-time words
TG Time Speech morpheme
Auxiliary particles (one class, 15 two classes)
U particle
Uzhe.
Ule, huh?
Uguo.
The bottom of the Ude1
Ude2 Ground
Ude3.
The Usuo
Udeng and so on.
Uyy as usual.
Udh words
In the case of ULS,
The Uzhi
Ulian ("Even elementary school students")
Verbs (one class, 9 two classes)
V Verb
VD Secondary verb
VN noun verb
Vshi verb "yes"
vyou verb "there"
VF Trend Verb
VX form verb
VI intransitive verb (inner verb)
VL Verb Idioms
VG Verb morpheme
Punctuation (one class, 16 two classes)
W Punctuation
Wkz opening parenthesis, full width: (([{"〖〈 Half angle: ([{<
Wky right parenthesis, full width:)]} "〗〉 half-width:)" {>
Wyz left quotation mark, full angle: "'"
Wyy Right quotation mark, full angle: "'"
WJ period, full angle:.
WW question mark, full angle:? Half-width:?
WT exclamation mark, full angle:! Half-width:!
WD comma, full-width:, half-width:,
WF semicolon, full-width:; half-width:;
WN comma, full angle:,
WM Colon, full angle:: Half angle::
WS ellipsis, full-width: ...
WP Dash, full angle:-half angle:-------
WB percent semicolon, full angle:%‰ half angle:%
WH unit symbol, full angle: ¥$£°℃ half angle: $
String (one class, 2 two classes)
X string
XX non-morpheme word
Xu URL url
Modal words (one class)
Y modal words (delete yg)
State words (one class)
Z State Word
Python stuttering participle