Notes on Natural language processing

Last Update:2017-10-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Chinese Natural language preprocessing

Experimental data preprocessing (this paper uses the Python version of stuttering participle)
1. Word segmentation and POS tagging for crawling reviews (Mac-result.txt)
2. For results only use the public version of the Stop vocabulary to stop the word, do not do manual screening (mac-result1.txt)
3. Select the part of speech and customize the reserved speech, the following example:
Reserved: nouns, noun phrases (both of which describe the subject of comments)
adjective, verb, verb phrase (description of the subject) and other words that may have notional
Removal: Adverbs, punctuation, quasi-sound words, etc. no notional words including/x/zg/uj/ul/e/d/uz/y
The result is Mac-result2.txt
4. Standardize processing, merging spaces, removing whitespace characters, processing the document into "words, spaces, words, spaces ...." "In the form. The result is: Mac-result3.txt
5. Compound word synthesis. The segmentation result is inaccurate, there are proper nouns and so on, so extract compound mac-result4.txt (extracted compound words table fuheci.txt)

2 calculation of the Chinese part of speech tag set

Version 3.0 Author: Liu Qun Zhang Huaping Zhang Hao

Calculation of the Chinese part-of-speech marker set ... 1

0. Description ... 1
1. Nouns (one class, 7 two classes, 5 three classes) 2
2. Time term (a class, a two Class) 2
3. Premises words (one Class) 3
4. Nouns of locality (one Class) 3
5. Verbs (one class, 9 two classes) 3
6. Adjectives (one class, 4 two classes) 3
7. Distinguishing words (one class, 2 classes of two) 3
8. State Word (one Class) 3
9. Pronouns (one class, 4 two classes, 6 three categories) 3
10. Numerals (one class, one two category) 4
11. quantifiers (one class, 2 two classes) 4
12. Adverbs (one Class) 4
13. Prepositions (one class, 2 two classes) 4
14. conjunctions (one class, one class two) 4
15. Auxiliary particles (one class, 15 class two) 4
16. interjection (one Class) 4
17. Modal words (one class) 5
18. Quasi-Sound words (one Class) 5
19. Prefix (one class) 5
20. Suffix (one Class) 5
21. String (One class, 2 class two) 5
22. Punctuation (one class, 16 two classes) 5

0. Description: The calculation of the Chinese word-of-speech marker set (total 99, 22 classes, 66 two classes, 11 three categories) is mainly used in Chinese Academy of Science and Technology Research Institute for the Development of China Lexical analyzer, syntactic parser and Chinese-English machine translation system. This tag set mainly refers to the following set of speech tags:

1. "People's Daily" Corpus of speech tagging set;

2. PKU 2002 new version of the POS tag set (draft);

3. Chinese Tree Library of Tsinghua University part-of-speech tagging set;

4. Ministry of Education Pragmatic Language tagging set (draft national recommended standards 2002);

5. The Chinese Tree Library (Chinesepenntreebank) of the University of Pennsylvania, United States of speech tagging set;

Because the Chinese lexical analyzer of the calculation is mainly used in Peking University "People's Daily" Corpus for parameter training, so this part of the part-of-speech tag set is mainly based on the "People's Daily" Corpus of POS tagging set as the blueprint, and reference to the Peking University "Chinese Grammar Information Dictionary" given in the Chinese word grammar information. The following factors are taken into account in the development process of this tag set:

1. Help to improve the segmentation and labeling accuracy of Chinese lexical analyzer;

2. Help to improve the accuracy of Chinese syntactic parser;

3. Facilitate translation of Chinese-English machine translation system;

4. Easy to convert from the "People's Daily" Corpus of speech tagging set in Peking University;

5. For words with different grammatical functions, it is as fine as possible to make the molecular classes as thin and difficult as to differentiate between lexical analysis and syntactic ambiguity.

Based on the above considerations, we try to avoid errors in the labeling process, and use those that are not prone to error, but to improve the accuracy of Chinese lexical syntactic analysis of the obvious effect of the mark. For example, in the subclass of a verb, we refer to the Chinese tree Library of the University of Pennsylvania, which makes the Chinese verb "yes" and "yes" separate, instead of the "system verb" mark. Because the same verb is "yes", its syntactic function is many, as "tie verb" is only one of the functions, and to distinguish these functions is very difficult, will lead to the correct rate of lexical analysis decreased.

In noun subclasses, we distinguish between "Chinese names", "Japanese names" and "names of people", not only because these three kinds of names need to be trained and identified by different parameters, but also to be translated in Chinese-English machine translation using different analytic algorithms. As another example, we combine the term "numeral +" year "(such as" 1995 ") into a time word, and the year" numeral + ' "is labeled as" numeral "and" quantifier ", this is because we experimentally found that this distinction in the lexical analysis phase through statistical methods can achieve higher accuracy, Moreover, this distinction is very important for subsequent syntactic analysis and machine translation.

For some parts of speech (particle and punctuation), it is basically a closed set, and the grammatical functions of each word in these parts of speech vary greatly, in which case we subdivide its subclasses as much as possible. In addition, similar to other parts of speech tagging, in our markup system, the small class is only some of the necessary special cases in the big class, but the division of the small class does not satisfy the completeness.

1. Nouns (one class, 7 two classes, 5 three classes) nouns are divided into the following subclasses:

n noun
NR Name
NR1 Chinese surname
NR2 Chinese name
NRJ Japanese names
NRF transliteration of names
NS Place Names
NSF Transliteration of place names
NT Institution Group name
NZ other proper names
NL noun Idiomatic language
ng noun morpheme

2. Time words (a class, a two Class)

T-time words
TG Time Speech morpheme

3. Premises words (one class)

S quarter Word

4. Nouns of locality (one class)

f noun

5. Verbs (one class, 9 two classes)

V Verb
VD Secondary verb
VN noun verb
Vshi verb "yes"
vyou verb "there"
VF Trend Verb
VX form verb
VI intransitive verb (inner verb)
VL Verb Idioms
VG Verb morpheme

6. Adjectives (one class, 4 two classes)

A adjective
Ad sub-type word
An noun
adjective morpheme of AG
Al adjective idiomatic language

7. Distinguishing words (one class, 2 two classes)

b Distinguishing Words
BL distinguishes the idiomatic phrase of speech

8. State words (one class)

Z State Word

9. Pronouns (one class, 4 two classes, 6 three categories)

R pronoun
RR Personal pronouns
RZ demonstrative pronoun
Rzt Time demonstrative pronoun
Rzs Quarter demonstrative pronoun
RZV predicate pronoun of part of speech
Ry interrogative pronouns
Ryt Time interrogative pronoun
Rys Quarter interrogative pronoun
RYV predicate interrogative pronoun of part of speech
RG Generation of speech morphemes

10. Numerals (one class, one category two)

M numerals
MQ number of words

11. quantifiers (one class, 2 two classes)

Q quantifier
QV Moving quantifiers
QT Time quantifier

12. Adverbs (one Class)

D adverb

13. Prepositions (one class, 2 two classes)

P Prepositions
PBA preposition "put"
Pbei preposition "by"

14. conjunctions (one class, one class two)

C conjunctions
CC parallel conjunctions

15. Auxiliary particles (one class, 15 two classes)

U particle
Uzhe.
Ule, huh?
Uguo.
The bottom of the Ude1
Ude2 Ground
Ude3.
The Usuo
Udeng and so on.
Uyy as usual.
Udh words
In the case of ULS,
The Uzhi
Ulian ("Even elementary school students")

16. interjection (one Class)

E interjection

17. Modal words (one class)

Y modal words (delete yg)

18. Quasi-Sound words (one Class)

o Quasi-sound words

19. Prefixes (one Class)

H prefix

20. Suffix (one Class)

K suffix

21. Strings (one class, 2 two classes)

X string
XX non-morpheme word
Xu URL url

22. Punctuation (one class, 16 two classes)

W Punctuation
Wkz opening parenthesis, full width: (([{"〖〈 Half angle: ([{<
Wky right parenthesis, full width:)]} "〗〉 half-width:)" {>
Wyz left quotation mark, full angle: "'"
Wyy Right quotation mark, full angle: "'"
WJ period, full angle:.
WW question mark, full angle:? Half-width:?
WT exclamation mark, full angle:! Half-width:!
WD comma, full-width:, half-width:,
WF semicolon, full-width:; half-width:;
WN comma, full angle:,
WM Colon, full angle:: Half angle::
WS ellipsis, full-width: ...
WP Dash, full angle:-half angle:-------
WB percent semicolon, full angle:%‰ half angle:%
WH unit symbol, full angle: ￥$￡°℃ half angle: $

Notes on Natural language processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Notes on Natural language processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Notes on Natural language processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support