Notes on Natural language processing

Source: Internet
Author: User

1 Chinese Natural language preprocessing
    • Experimental data preprocessing (this paper uses the Python version of stuttering participle)
    • 1. Word segmentation and POS tagging for crawling reviews (Mac-result.txt)
    • 2. For results only use the public version of the Stop vocabulary to stop the word, do not do manual screening (mac-result1.txt)
    • 3. Select the part of speech and customize the reserved speech, the following example:
    • Reserved: nouns, noun phrases (both of which describe the subject of comments)
    • adjective, verb, verb phrase (description of the subject) and other words that may have notional
    • Removal: Adverbs, punctuation, quasi-sound words, etc. no notional words including/x/zg/uj/ul/e/d/uz/y
    • The result is Mac-result2.txt
    • 4. Standardize processing, merging spaces, removing whitespace characters, processing the document into "words, spaces, words, spaces ...." "In the form. The result is: Mac-result3.txt
    • 5. Compound word synthesis. The segmentation result is inaccurate, there are proper nouns and so on, so extract compound mac-result4.txt (extracted compound words table fuheci.txt)
2 calculation of the Chinese part of speech tag set

Version 3.0 Author: Liu Qun Zhang Huaping Zhang Hao

Calculation of the Chinese part-of-speech marker set ... 1

    • 0. Description ... 1
    • 1. Nouns (one class, 7 two classes, 5 three classes) 2
    • 2. Time term (a class, a two Class) 2
    • 3. Premises words (one Class) 3
    • 4. Nouns of locality (one Class) 3
    • 5. Verbs (one class, 9 two classes) 3
    • 6. Adjectives (one class, 4 two classes) 3
    • 7. Distinguishing words (one class, 2 classes of two) 3
    • 8. State Word (one Class) 3
    • 9. Pronouns (one class, 4 two classes, 6 three categories) 3
    • 10. Numerals (one class, one two category) 4
    • 11. quantifiers (one class, 2 two classes) 4
    • 12. Adverbs (one Class) 4
    • 13. Prepositions (one class, 2 two classes) 4
    • 14. conjunctions (one class, one class two) 4
    • 15. Auxiliary particles (one class, 15 class two) 4
    • 16. interjection (one Class) 4
    • 17. Modal words (one class) 5
    • 18. Quasi-Sound words (one Class) 5
    • 19. Prefix (one class) 5
    • 20. Suffix (one Class) 5
    • 21. String (One class, 2 class two) 5
    • 22. Punctuation (one class, 16 two classes) 5

0. Description: The calculation of the Chinese word-of-speech marker set (total 99, 22 classes, 66 two classes, 11 three categories) is mainly used in Chinese Academy of Science and Technology Research Institute for the Development of China Lexical analyzer, syntactic parser and Chinese-English machine translation system. This tag set mainly refers to the following set of speech tags:

1. "People's Daily" Corpus of speech tagging set;

2. PKU 2002 new version of the POS tag set (draft);

3. Chinese Tree Library of Tsinghua University part-of-speech tagging set;

4. Ministry of Education Pragmatic Language tagging set (draft national recommended standards 2002);

5. The Chinese Tree Library (Chinesepenntreebank) of the University of Pennsylvania, United States of speech tagging set;


Because the Chinese lexical analyzer of the calculation is mainly used in Peking University "People's Daily" Corpus for parameter training, so this part of the part-of-speech tag set is mainly based on the "People's Daily" Corpus of POS tagging set as the blueprint, and reference to the Peking University "Chinese Grammar Information Dictionary" given in the Chinese word grammar information. The following factors are taken into account in the development process of this tag set:

1. Help to improve the segmentation and labeling accuracy of Chinese lexical analyzer;

2. Help to improve the accuracy of Chinese syntactic parser;

3. Facilitate translation of Chinese-English machine translation system;

4. Easy to convert from the "People's Daily" Corpus of speech tagging set in Peking University;

5. For words with different grammatical functions, it is as fine as possible to make the molecular classes as thin and difficult as to differentiate between lexical analysis and syntactic ambiguity.

Based on the above considerations, we try to avoid errors in the labeling process, and use those that are not prone to error, but to improve the accuracy of Chinese lexical syntactic analysis of the obvious effect of the mark. For example, in the subclass of a verb, we refer to the Chinese tree Library of the University of Pennsylvania, which makes the Chinese verb "yes" and "yes" separate, instead of the "system verb" mark. Because the same verb is "yes", its syntactic function is many, as "tie verb" is only one of the functions, and to distinguish these functions is very difficult, will lead to the correct rate of lexical analysis decreased.


In noun subclasses, we distinguish between "Chinese names", "Japanese names" and "names of people", not only because these three kinds of names need to be trained and identified by different parameters, but also to be translated in Chinese-English machine translation using different analytic algorithms. As another example, we combine the term "numeral +" year "(such as" 1995 ") into a time word, and the year" numeral + ' "is labeled as" numeral "and" quantifier ", this is because we experimentally found that this distinction in the lexical analysis phase through statistical methods can achieve higher accuracy, Moreover, this distinction is very important for subsequent syntactic analysis and machine translation.

For some parts of speech (particle and punctuation), it is basically a closed set, and the grammatical functions of each word in these parts of speech vary greatly, in which case we subdivide its subclasses as much as possible. In addition, similar to other parts of speech tagging, in our markup system, the small class is only some of the necessary special cases in the big class, but the division of the small class does not satisfy the completeness.


1. Nouns (one class, 7 two classes, 5 three classes) nouns are divided into the following subclasses:

    • n noun
    • NR Name
    • NR1 Chinese surname
    • NR2 Chinese name
    • NRJ Japanese names
    • NRF transliteration of names
    • NS Place Names
    • NSF Transliteration of place names
    • NT Institution Group name
    • NZ other proper names
    • NL noun Idiomatic language
    • ng noun morpheme

2. Time words (a class, a two Class)

    • T-time words
    • TG Time Speech morpheme

3. Premises words (one class)

    • S quarter Word

4. Nouns of locality (one class)

    • f noun

5. Verbs (one class, 9 two classes)

    • V Verb
    • VD Secondary verb
    • VN noun verb
    • Vshi verb "yes"
    • vyou verb "there"
    • VF Trend Verb
    • VX form verb
    • VI intransitive verb (inner verb)
    • VL Verb Idioms
    • VG Verb morpheme

6. Adjectives (one class, 4 two classes)

    • A adjective
    • Ad sub-type word
    • An noun
    • adjective morpheme of AG
    • Al adjective idiomatic language

7. Distinguishing words (one class, 2 two classes)

    • b Distinguishing Words
    • BL distinguishes the idiomatic phrase of speech

8. State words (one class)

    • Z State Word

9. Pronouns (one class, 4 two classes, 6 three categories)

    • R pronoun
    • RR Personal pronouns
    • RZ demonstrative pronoun
    • Rzt Time demonstrative pronoun
    • Rzs Quarter demonstrative pronoun
    • RZV predicate pronoun of part of speech
    • Ry interrogative pronouns
    • Ryt Time interrogative pronoun
    • Rys Quarter interrogative pronoun
    • RYV predicate interrogative pronoun of part of speech
    • RG Generation of speech morphemes

10. Numerals (one class, one category two)

    • M numerals
    • MQ number of words

11. quantifiers (one class, 2 two classes)

    • Q quantifier
    • QV Moving quantifiers
    • QT Time quantifier

12. Adverbs (one Class)

    • D adverb

13. Prepositions (one class, 2 two classes)

    • P Prepositions
    • PBA preposition "put"
    • Pbei preposition "by"

14. conjunctions (one class, one class two)

    • C conjunctions
    • CC parallel conjunctions

15. Auxiliary particles (one class, 15 two classes)

    • U particle
    • Uzhe.
    • Ule, huh?
    • Uguo.
    • The bottom of the Ude1
    • Ude2 Ground
    • Ude3.
    • The Usuo
    • Udeng and so on.
    • Uyy as usual.
    • Udh words
    • In the case of ULS,
    • The Uzhi
    • Ulian ("Even elementary school students")

16. interjection (one Class)

    • E interjection

17. Modal words (one class)

    • Y modal words (delete yg)

18. Quasi-Sound words (one Class)

    • o Quasi-sound words

19. Prefixes (one Class)

    • H prefix

20. Suffix (one Class)

    • K suffix

21. Strings (one class, 2 two classes)

    • X string
    • XX non-morpheme word
    • Xu URL url

22. Punctuation (one class, 16 two classes)

    • W Punctuation
    • Wkz opening parenthesis, full width: (([{"〖〈 Half angle: ([{<
    • Wky right parenthesis, full width:)]} "〗〉 half-width:)" {>
    • Wyz left quotation mark, full angle: "'"
    • Wyy Right quotation mark, full angle: "'"
    • WJ period, full angle:.
    • WW question mark, full angle:? Half-width:?
    • WT exclamation mark, full angle:! Half-width:!
    • WD comma, full-width:, half-width:,
    • WF semicolon, full-width:; half-width:;
    • WN comma, full angle:,
    • WM Colon, full angle:: Half angle::
    • WS ellipsis, full-width: ...
    • WP Dash, full angle:-half angle:-------
    • WB percent semicolon, full angle:%‰ half angle:%
    • WH unit symbol, full angle: ¥$£°℃ half angle: $

Notes on Natural language processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.