Basic usage of python jieba word segmentation module, pythonjieba
Jieba is a powerful word segmentation dictionary that supports Chinese word segmentation. This article briefly summarizes its basic usage.
Features
- Three word segmentation modes are supported:
- Accurate mode, which is suitable for text analysis;
- Full mode: scans all words in a sentence that can be used as words. The speed is very fast, but ambiguity cannot be solved;
- The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.
- Supports traditional Chinese Word Segmentation
- Supports custom dictionaries
- MIT authorization Protocol
Install jieba
pip install jieba
Simple usage
Jieba word segmentation is divided into three modes: exact mode (default), full mode, and search engine mode. The following are examples of these three modes:
Exact Mode
Import jiebas = U' I want to visit and stroll with my girlfriend at the Palace Museum in Beijing. '
cut = jieba.cut(s)print '【Output】'print cutprint ','.join(cut)
[Output] <generator object cut at 0x7f8dbc0efc30> I, want, and, my girlfriend, join, visit, visit, and stroll at the Palace Museum in Beijing ,.
It can be seen that the word splitting result returns a generator (which is especially important for word splitting of large data volumes ).
Full Mode
print '【Output】'print ','.join(jieba.cut(s,cut_all = True))
[Output] I, want to, and, my girlfriend, and friends, go to Beijing, Beijing, the Palace Museum in Beijing, the Palace Museum, the Palace Museum, museum, visit, and, loose ,,
The full Mode means to divide the text into as many words as possible.
Search engine Mode
print '【Output】'print ','.join(jieba.cut_for_search(s))
[Output] I, want to, And, friends, girlfriends, go to Beijing, the Palace Museum, the museum, the Palace Museum in Beijing, visit, and stroll ,.
Get part of speech
Each word has its own part of speech, such as nouns, vertices, and pronouns. The result of jieba word segmentation can also contain the part of speech of each word. jieba. posseg is used as an example:
Import jieba. posseg as psuplint '[Output] 'print [(x. word, x. flag) for x in psg. cut (s)] # output: ''' [(u'my', u'r'), (u'think ', u'v '), (u 'and', u 'C'), (u 'girlfriend ', u 'n'), (u' together ', u 'M '), (u'go ', u'v'), (u'beijing Palace Museum', u'ns '), (u'visit', u'n '), (U' and ', u'c'), (u'loan', u'v'), (U '. ', U'x')] '''
We can see that the part of speech of each word is successfully obtained, which is helpful for further processing of the word splitting result. For example, if you only want to obtain the nouns in the word splitting result list, you can filter them as follows:
Print [(x. word, x. flag) for x in psg. cut (s) if x. flag. startswith ('n')] # output: ''' [(u'girlfriend ', u'n'), (u'beijing Palace Museum', u'ns '), (u'visit ', u'n')] '''
As for what parts of speech each letter represents, the jieba word splitting result may contain parts of speech. You need to refer to the part of speech comparison table. A word-of-speech comparison table found on the internet is provided at the end of this article, to learn more about the part-of-speech classification, you can go to the Internet to search for "Stuttering word segmentation word-of-speech comparison ".
Parallel Word Segmentation
When the amount of text data is very large, it is necessary to enable parallel word segmentation to improve word segmentation efficiency. Jieba supports parallel word splitting, which is based on the python-provided multiprocessing module. However, it must be noted that it is not supported in Windows.
Usage:
# Enable the parallel word splitting mode. The parameter is the number of concurrent processes in jieba. enable_parallel (5) # disable the parallel word splitting mode jieba. disable_parallel ()
Example: Enable the parallel word splitting mode to perform word segmentation for the three-body full text
santi_text = open('./santi.txt').read()print len(santi_text)
2681968
We can see that the data volume of the complete three-body collection is still very large, with a length of more than 2.6 million bytes.
jieba.enable_parallel(100)santi_words = [x for x in jieba.cut(santi_text) if len(x) >= 2]jieba.disable_parallel()
Obtain Top n words
Take the preceding three-body complete text as an example. If you want to obtain the list of the first 20 words that appear frequently in the word splitting result, you can obtain the following:
From collections import Counterc = Counter (santi_words ). most_common (20) print c # output: ''' [(U' \ r \ n', 21805), (U' A ', 3057), (U' n ', 2128), (u'theirs, 1690), (u'ours, 1550), (u'ours, 1357), (u'yourself, 1347 ), (u 'cheng xin', 1320), (u '', 1273), (u 'already', 1259), (u 'World', 1243 ), (u 'record ', 1189), (u' May ', 1177), (u' What ', 1176), (u' see ', 1114 ), (u' ', 1094), (u'global', 951), (u'human', 935), (u'space', 930 ), (u' ', 883)] '''
In the result, '\ r \ n' is the most frequently occurring word, there are also non-practical words such as one, none, and this. you can filter words based on the part of speech mentioned above.
Improve word segmentation accuracy using user dictionaries
Do not use the word segmentation result of the user dictionary:
Txt = u'ouyang Jianguo is the director of the Innovation Office and also an expert in cloud computing in the gathering age. 'print ','. join (jieba. cut (txt ))
Ouyang, Jianguo, yes, innovation, office, Director, also, yes, gathering, times, companies, clouds, computing, aspects, experts
Result of Word Segmentation Using the user dictionary:
jieba.load_userdict('user_dict.txt')print ','.join(jieba.cut(txt))
Ouyang Jianguo is, Innovation Office, Director, also, an expert in the Age of gathering, companies, cloud computing, and other aspects.
We can see that the accuracy of Word Segmentation after using the user dictionary is greatly improved.
Note: The content of user_dict.txt is as follows:
Ouyang Jianguo 5
Innovation Office 5 I
Gathering age 5
Cloud computing 5
Each word in the user dictionary is in the following format:
Word Frequency
The word frequency is a number, and the part of speech is a custom part of speech. Note that word frequency numbers and spaces are half-width characters.
Appendix: part-of-speech comparison table of jieba word segmentation (sorted by the first letter of speech)
Adjectives (1 class 1, 4 Class 2)
A Adjective
Ad sub-morphology
An nameword
Ag adjectives
Al adjective Idioms
Differentiate words (one Class One, Two Class Two)
Area B
Bl differences
Join words (one class one, one Class Two)
C join words
Cc coordinate concatenation
Adverbs (1 class)
D adverbs
Exclamation point (1 class)
E. Exclamation point
Acronyms (1 class)
F Acronyms
Prefix (1 class)
H prefix
Suffix (1 class)
K suffix
Number words (one class one, one Class Two)
M words
Mq quantifiers
Terms (1 class, 7 class, 5 Class)
Nouns are divided into the following sub-categories:
N Nouns
Nr name
Nr1 Chinese surname
Nr2 Chinese name
Nrj Japanese name
Nrf transliteration
Ns Place Names
Nsf transliteration Place Name
Nt organization group name
Other nz special names
Nl terminology
Ng Nouns
Anthropomorphic words (1 class)
O anthropomorphic words
Prepositions (one Class One, Two Class Two)
P
Pba prefix "handle"
Pbei prefix "quilt"
Quantifiers (one Class One, Two Class Two)
Q quantifiers
Qv momentum words
Qt time quantifiers
Pronoun (one class one, four Class Two, six Class Three)
R Pronoun
Rr Personal Pronoun
Rz Pronoun
Rzt time indicator Pronoun
Rzs
Rzv predicates
Ry question Pronoun
Ryt time question Pronoun
Rys place question Pronoun
Ryv predicate question Pronoun
Rg Pronoun
Term (1 class)
S premises word
Time term (one class one, one Class Two)
T-time term
Tg time part-of-speech Phoneme
Auxiliary Word (one class one, 15 Class Two)
U Auxiliary
Uzhe
Ule
Uguo
Bottom of ude1
Ude2 location
Ude3
Usuo Institute
Cloud, such as udeng
Uyy is similar
Udh
For uls
Uzhi
Ulian connection ("connect to primary school students ")
Verb (one class one, nine Class Two)
V verb
Vd subverb
Vn name verb
Vshi verb "yes"
Vyou verb "yes"
Vf trend verb
Vx form verb
Vi inactive verb (inner verb)
Vl verb Idioms
Vg verb Phoneme
Punctuation Marks (one class one, 16 Class Two)
W punctuation
Wkz left parenthesis, full angle :( [{[] halfwidth :( [{<
Wky right parenthesis, fullwidth :)]}"
Wyz left quotes, full angle: "'"
Wyy's right quotation mark :"'』
Wj full stop, full angle :.
Ww question mark, fullwidth :? Halfwidth :?
Wt exclamation point, full angle :! Halfwidth :!
Wd comma, fullwidth:, halfwidth :,
Wf semicolon, fullwidth:; halfwidth :;
Wn ton, fullwidth :,
Wm colon, fullwidth: halfwidth ::
Ws ellipsis, full angle :...... ...
Wp dash, full angle: -- half angle :-------
Wb semicolon, full width: % ‰ half width: %
Wh unit symbol, full angle: $ £ ° °C half angle: $
String (one Class One, Two Class Two)
X string
Xx non-morphological characters
Xu URL
Modal words (1 class)
Delete yg)
Status word (1 class)
Z status word
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.