Python stuttering participle

Source: Internet
Author: User
Tags comparison table generator

reprint: http://www.cnblogs.com/jiayongji/p/7119065.html

Stuttering is a powerful word-breaker.

Installing Jieba

pip install jieba

Simple usage

Stuttering participle is divided into three modes: exact mode (default), Full mode and search engine mode, the following are examples of these three modes:

Precision Mode
import jiebas = u‘我想和女朋友一起去北京故宫博物院参观和闲逛。‘
= jieba.cut(s)print ‘【Output】‘print cutprint ‘,‘.join(cut)
【Output】<generator object cut at 0x7f8dbc0efc30>我,想,和,女朋友,一起,去,北京故宫博物院,参观,和,闲逛,。

The result of a visible word breaker is a generator (which is especially important for segmenting large data volumes).

Full mode
print ‘【Output】‘print ‘,‘.join(jieba.cut(s,cut_all = True))
【Output】我,想,和,女朋友,朋友,一起,去,北京,北京故宫,北京故宫博物院,故宫,故宫博物院,博物,博物院,参观,和,闲逛,,

The whole pattern is to divide the text into as many words as possible.

Search engine mode
print ‘【Output】‘print ‘,‘.join(jieba.cut_for_search(s))
【Output】我,想,和,朋友,女朋友,一起,去,北京,故宫,博物,博物院,北京故宫博物院,参观,和,闲逛,。
Get part of speech

Each word has its part of speech, such as nouns, verbs, pronouns, etc., the result of stuttering participle can also take each word's part of speech, to use the jieba.posseg, for example, as follows:

Import jieba.possegas psgprint  "Output" ' print [(X.word , X.flag) for x in Psg.cut (s)]# output: [(U ' i ', U ' r '), (U ' want ', U ' V '), (U ' and ', U ' C '), (U ' girlfriend ', U ') n '), (U ' together ', U ' m '),  (U ' go ', U ' V '), (U ' Beijing Palace Museum ', U ' NS '), (U ' visit ', u ' n '), (U ' and ', U ' C '), (U ' loitering ', U ' V '), (U '). ', U ' x ')] "     

You can see the successful acquisition of the word of speech, which is useful for us to further deal with the results of the word segmentation, such as just want to get the word in the results list, then you can filter:

print [(x.word,x.flag) for x in psg.cut(s) if x.flag.startswith(‘n‘)]# 输出:‘‘‘[(u‘女朋友‘, u‘n‘), (u‘北京故宫博物院‘, u‘ns‘), (u‘参观‘, u‘n‘)]‘‘‘

As for the part of speech of each letter to indicate what part of speech, Jieba participle of the results may have what part of speech, it is necessary to consult the part of the list of parts of speech, the end of this article with a copy from the online word of speech, want to know more detailed classification information, can go online search "stuttering participle of speech control."

Parallel participle

When the amount of text data is very large, in order to improve the efficiency of word segmentation, it is necessary to open parallel participle. Jieba supports parallel participle, based on Python's own multiprocessing module, but note that it is not supported in Windows environments.

Usage:

# 开启并行分词模式,参数为并发执行的进程数jieba.enable_parallel(5)# 关闭并行分词模式jieba.disable_parallel()

Example: turning on parallel word segmentation to Word segmentation of three-body complete text

= open(‘./santi.txt‘).read()print len(santi_text)
2681968

You can see that the amount of data in the complete three is still very large, with a length of more than 2.6 million bytes.

jieba.enable_parallel(100)santi_words = [x for x in jieba.cut(santi_text) if len(x) >= 2]jieba.disable_parallel()
Gets the word with the top n occurrence frequency

For example, if you want to get a list of the first 20 words in the word segmentation results, you can get the following:

From collectionsImport Counterc= Counter (santi_words). Most_common (20) print c# output:  " [(U ' \ r \ n ', 21805), (U ' one ', 3057), (U ' no ', 2128), (U ' they ', 1690), (U ' Us ', 1550),  (U ' this ', 1357), (U ' own ', 1347), (U ' Cheng ', 1320), (U ' Now ', 1273), (U ' already ', 1259),  (U ' world ', 1243), (U ' rom ', 1189), (U ' may ', 1177), (U ' what ', 1176), (U ' see ', 1114),  (U ' know ', 1094), (U ' earth ', 951), (U ' Human ', 935 ), (U ' space ', 930), (U ' tri-Body ', 883)]  

You can see the results of ' \ r \ n ' actually appear the most frequent words, there are ' one ', ' no ', ' this ' and so on we do not want the meaningless words, then can be based on the word of speech in front of the filter, this later fine talk.

Use user dictionaries to improve word segmentation accuracy

Word breaker results that do not use a user dictionary:

= u‘欧阳建国是创新办主任也是欢聚时代公司云计算方面的专家‘print ‘,‘.join(jieba.cut(txt))
欧阳,建国,是,创新,办,主任,也,是,欢聚,时代,公司,云,计算,方面,的,专家

Use the word breaker result for the user dictionary:

jieba.load_userdict(‘user_dict.txt‘)print ‘,‘.join(jieba.cut(txt))
欧阳建国,是,创新办,主任,也,是,欢聚时代,公司,云计算,方面,的,专家

You can see that the accuracy of word segmentation is greatly improved when using user dictionaries.

Note: The contents of User_dict.txt are as follows:

Ouyang Jianguo 5

Innovation Office 5 I

Time to gather 5

Cloud Computing 5

The user dictionary has one word per line, in the form:

Word frequency, part of speech

Where the word frequency is a number, part of speech for the custom part of speech, to note that the word frequency numbers and spaces are half-width.

Attached: Stuttering word of speech comparison table (alphabetically alphabetical order) adjectives (a class, 4 two classes)

A adjective

Ad sub-type word

An noun

adjective morpheme of AG

Al adjective idiomatic language

Distinguishing words (one class, 2 two classes)

b Distinguishing Words

BL distinguishes the idiomatic phrase of speech

Conjunctions (one category, one class two)

C conjunctions

CC parallel conjunctions

Adverbs (one Class)

D adverb

Interjection (one Class)

E interjection

Nouns of locality (one class)

f noun

Prefix (one class)

H prefix

Suffix (one Class)

K suffix

Numerals (one category, one class two)

M numerals

MQ number of words

Nouns (one class, 7 two classes, 5 three categories)

Nouns are divided into the following sub-categories:

n noun

NR Name

NR1 Chinese surname

NR2 Chinese name

NRJ Japanese names

NRF transliteration of names

NS Place Names

NSF Transliteration of place names

NT Institution Group name

NZ other proper names

NL noun Idiomatic language

ng noun morpheme

Quasi-Sound words (one Class)

o Quasi-sound words

Prepositions (one class, 2 two classes)

P Prepositions

PBA preposition "put"

Pbei preposition "by"

Quantifier (one class, 2 two classes)

Q quantifier

QV Moving quantifiers

QT Time quantifier

Pronouns (one class, 4 two classes, 6 classes)

R pronoun

RR Personal pronouns

RZ demonstrative pronoun

Rzt Time demonstrative pronoun

Rzs Quarter demonstrative pronoun

RZV predicate pronoun of part of speech

Ry interrogative pronouns

Ryt Time interrogative pronoun

Rys Quarter interrogative pronoun

RYV predicate interrogative pronoun of part of speech

RG Generation of speech morphemes

Premises Words (one class)

S quarter Word

Time words (a class, a class two)

T-time words

TG Time Speech morpheme

Auxiliary particles (one class, 15 two classes)

U particle

Uzhe.

Ule, huh?

Uguo.

The bottom of the Ude1

Ude2 Ground

Ude3.

The Usuo

Udeng and so on.

Uyy as usual.

Udh words

In the case of ULS,

The Uzhi

Ulian ("Even elementary school students")

Verbs (one class, 9 two classes)

V Verb

VD Secondary verb

VN noun verb

Vshi verb "yes"

vyou verb "there"

VF Trend Verb

VX form verb

VI intransitive verb (inner verb)

VL Verb Idioms

VG Verb morpheme

Punctuation (one class, 16 two classes)

W Punctuation

Wkz opening parenthesis, full width: (([{"〖〈 Half angle: ([{<

Wky right parenthesis, full width:)]} "〗〉 half-width:)" {>

Wyz left quotation mark, full angle: "'"

Wyy Right quotation mark, full angle: "'"

WJ period, full angle:.

WW question mark, full angle:? Half-width:?

WT exclamation mark, full angle:! Half-width:!

WD comma, full-width:, half-width:,

WF semicolon, full-width:; half-width:;

WN comma, full angle:,

WM Colon, full angle:: Half angle::

WS ellipsis, full-width: ...

WP Dash, full angle:-half angle:-------

WB percent semicolon, full angle:%‰ half angle:%

WH unit symbol, full angle: ¥$£°℃ half angle: $

String (one class, 2 two classes)

X string

XX non-morpheme word

Xu URL url

Modal words (one class)

Y modal words (delete yg)

State words (one class)

Z State Word

Python stuttering participle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.