Basic usage of python jieba word segmentation module, pythonjieba

Source: Internet
Author: User
Tags comparison table

Basic usage of python jieba word segmentation module, pythonjieba

Jieba is a powerful word segmentation dictionary that supports Chinese word segmentation. This article briefly summarizes its basic usage.

Features

  1. Three word segmentation modes are supported:
    1. Accurate mode, which is suitable for text analysis;
    2. Full mode: scans all words in a sentence that can be used as words. The speed is very fast, but ambiguity cannot be solved;
    3. The search engine mode, based on the precise mode, further segmentation of long words to improve the recall rate, is suitable for word segmentation of search engines.
  2. Supports traditional Chinese Word Segmentation
  3. Supports custom dictionaries
  4. MIT authorization Protocol

Install jieba

pip install jieba

Simple usage

Jieba word segmentation is divided into three modes: exact mode (default), full mode, and search engine mode. The following are examples of these three modes:

Exact Mode

Import jiebas = U' I want to visit and stroll with my girlfriend at the Palace Museum in Beijing. '
cut = jieba.cut(s)print '【Output】'print cutprint ','.join(cut)
[Output] <generator object cut at 0x7f8dbc0efc30> I, want, and, my girlfriend, join, visit, visit, and stroll at the Palace Museum in Beijing ,.

It can be seen that the word splitting result returns a generator (which is especially important for word splitting of large data volumes ).

Full Mode

print '【Output】'print ','.join(jieba.cut(s,cut_all = True))
[Output] I, want to, and, my girlfriend, and friends, go to Beijing, Beijing, the Palace Museum in Beijing, the Palace Museum, the Palace Museum, museum, visit, and, loose ,,

The full Mode means to divide the text into as many words as possible.

Search engine Mode

print '【Output】'print ','.join(jieba.cut_for_search(s))
[Output] I, want to, And, friends, girlfriends, go to Beijing, the Palace Museum, the museum, the Palace Museum in Beijing, visit, and stroll ,.

Get part of speech

Each word has its own part of speech, such as nouns, vertices, and pronouns. The result of jieba word segmentation can also contain the part of speech of each word. jieba. posseg is used as an example:

Import jieba. posseg as psuplint '[Output] 'print [(x. word, x. flag) for x in psg. cut (s)] # output: ''' [(u'my', u'r'), (u'think ', u'v '), (u 'and', u 'C'), (u 'girlfriend ', u 'n'), (u' together ', u 'M '), (u'go ', u'v'), (u'beijing Palace Museum', u'ns '), (u'visit', u'n '), (U' and ', u'c'), (u'loan', u'v'), (U '. ', U'x')] '''

We can see that the part of speech of each word is successfully obtained, which is helpful for further processing of the word splitting result. For example, if you only want to obtain the nouns in the word splitting result list, you can filter them as follows:

Print [(x. word, x. flag) for x in psg. cut (s) if x. flag. startswith ('n')] # output: ''' [(u'girlfriend ', u'n'), (u'beijing Palace Museum', u'ns '), (u'visit ', u'n')] '''

As for what parts of speech each letter represents, the jieba word splitting result may contain parts of speech. You need to refer to the part of speech comparison table. A word-of-speech comparison table found on the internet is provided at the end of this article, to learn more about the part-of-speech classification, you can go to the Internet to search for "Stuttering word segmentation word-of-speech comparison ".

Parallel Word Segmentation

When the amount of text data is very large, it is necessary to enable parallel word segmentation to improve word segmentation efficiency. Jieba supports parallel word splitting, which is based on the python-provided multiprocessing module. However, it must be noted that it is not supported in Windows.

Usage:

# Enable the parallel word splitting mode. The parameter is the number of concurrent processes in jieba. enable_parallel (5) # disable the parallel word splitting mode jieba. disable_parallel ()

Example: Enable the parallel word splitting mode to perform word segmentation for the three-body full text

santi_text = open('./santi.txt').read()print len(santi_text)

2681968

We can see that the data volume of the complete three-body collection is still very large, with a length of more than 2.6 million bytes.

jieba.enable_parallel(100)santi_words = [x for x in jieba.cut(santi_text) if len(x) >= 2]jieba.disable_parallel()

Obtain Top n words

Take the preceding three-body complete text as an example. If you want to obtain the list of the first 20 words that appear frequently in the word splitting result, you can obtain the following:

From collections import Counterc = Counter (santi_words ). most_common (20) print c # output: ''' [(U' \ r \ n', 21805), (U' A ', 3057), (U' n ', 2128), (u'theirs, 1690), (u'ours, 1550), (u'ours, 1357), (u'yourself, 1347 ), (u 'cheng xin', 1320), (u '', 1273), (u 'already', 1259), (u 'World', 1243 ), (u 'record ', 1189), (u' May ', 1177), (u' What ', 1176), (u' see ', 1114 ), (u' ', 1094), (u'global', 951), (u'human', 935), (u'space', 930 ), (u' ', 883)] '''

In the result, '\ r \ n' is the most frequently occurring word, there are also non-practical words such as one, none, and this. you can filter words based on the part of speech mentioned above.

Improve word segmentation accuracy using user dictionaries

Do not use the word segmentation result of the user dictionary:

Txt = u'ouyang Jianguo is the director of the Innovation Office and also an expert in cloud computing in the gathering age. 'print ','. join (jieba. cut (txt ))

Ouyang, Jianguo, yes, innovation, office, Director, also, yes, gathering, times, companies, clouds, computing, aspects, experts

Result of Word Segmentation Using the user dictionary:

jieba.load_userdict('user_dict.txt')print ','.join(jieba.cut(txt))

Ouyang Jianguo is, Innovation Office, Director, also, an expert in the Age of gathering, companies, cloud computing, and other aspects.

We can see that the accuracy of Word Segmentation after using the user dictionary is greatly improved.

Note: The content of user_dict.txt is as follows:

Ouyang Jianguo 5

Innovation Office 5 I

Gathering age 5

Cloud computing 5

Each word in the user dictionary is in the following format:

Word Frequency

The word frequency is a number, and the part of speech is a custom part of speech. Note that word frequency numbers and spaces are half-width characters.

Appendix: part-of-speech comparison table of jieba word segmentation (sorted by the first letter of speech)

Adjectives (1 class 1, 4 Class 2)

A Adjective

Ad sub-morphology

An nameword

Ag adjectives

Al adjective Idioms

Differentiate words (one Class One, Two Class Two)

Area B

Bl differences

Join words (one class one, one Class Two)

C join words

Cc coordinate concatenation

Adverbs (1 class)

D adverbs

Exclamation point (1 class)

E. Exclamation point

Acronyms (1 class)

F Acronyms

Prefix (1 class)

H prefix

Suffix (1 class)

K suffix

Number words (one class one, one Class Two)

M words

Mq quantifiers

Terms (1 class, 7 class, 5 Class)

Nouns are divided into the following sub-categories:

N Nouns

Nr name

Nr1 Chinese surname

Nr2 Chinese name

Nrj Japanese name

Nrf transliteration

Ns Place Names

Nsf transliteration Place Name

Nt organization group name

Other nz special names

Nl terminology

Ng Nouns

Anthropomorphic words (1 class)

O anthropomorphic words

Prepositions (one Class One, Two Class Two)

P

Pba prefix "handle"

Pbei prefix "quilt"

Quantifiers (one Class One, Two Class Two)

Q quantifiers

Qv momentum words

Qt time quantifiers

Pronoun (one class one, four Class Two, six Class Three)

R Pronoun

Rr Personal Pronoun

Rz Pronoun

Rzt time indicator Pronoun

Rzs

Rzv predicates

Ry question Pronoun

Ryt time question Pronoun

Rys place question Pronoun

Ryv predicate question Pronoun

Rg Pronoun

Term (1 class)

S premises word

Time term (one class one, one Class Two)

T-time term

Tg time part-of-speech Phoneme

Auxiliary Word (one class one, 15 Class Two)

U Auxiliary

Uzhe

Ule

Uguo

Bottom of ude1

Ude2 location

Ude3

Usuo Institute

Cloud, such as udeng

Uyy is similar

Udh

For uls

Uzhi

Ulian connection ("connect to primary school students ")

Verb (one class one, nine Class Two)

V verb

Vd subverb

Vn name verb

Vshi verb "yes"

Vyou verb "yes"

Vf trend verb

Vx form verb

Vi inactive verb (inner verb)

Vl verb Idioms

Vg verb Phoneme

Punctuation Marks (one class one, 16 Class Two)

W punctuation

Wkz left parenthesis, full angle :( [{[] halfwidth :( [{<

Wky right parenthesis, fullwidth :)]}"

Wyz left quotes, full angle: "'"

Wyy's right quotation mark :"'』

Wj full stop, full angle :.

Ww question mark, fullwidth :? Halfwidth :?

Wt exclamation point, full angle :! Halfwidth :!

Wd comma, fullwidth:, halfwidth :,

Wf semicolon, fullwidth:; halfwidth :;

Wn ton, fullwidth :,

Wm colon, fullwidth: halfwidth ::

Ws ellipsis, full angle :...... ...

Wp dash, full angle: -- half angle :-------

Wb semicolon, full width: % ‰ half width: %

Wh unit symbol, full angle: $ £ ° °C half angle: $

String (one Class One, Two Class Two)

X string

Xx non-morphological characters

Xu URL

Modal words (1 class)

Delete yg)

Status word (1 class)

Z status word

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.