The set of part-of-speech tagging-computing Chinese

Source: Internet
Author: User

Calculation of the Chinese part of speech tag Set version 3.0: Liu Qun Zhang Huaping Zhang Hao calculation of the Chinese part of the word tag set 10. Description 11. Noun (one class, 7 two class, 5 three classes) 22. Time Word (one class, one two Class) 23. Place of words (one class) 34. Nouns of locality (one class) 35. Verbs (one class, 9 two classes) 36. Adjectives (one class, 4 two classes) 37. Word of distinction (one class, 2 two Class) 38. State Word (one Class) 39. Pronoun (one class, 4 two class, 6 three classes) 310. Numeral (one class, one two Class) 411. Quantifier ( 1 class, 2 two classes) 412. Adverbs (one Class) 413. Prepositions (one class, 2 two classes) 414. conjunctions (one class, one two Class) 415. Auxiliary (one class, 15 two Class) 416. interjection (one Class) 417. Modal words (one class) 518. Quasi-sound words ( 1 categories) 519. prefix (one class) 520. Suffix (one Class) 521. String (one class, 2 two Class) 522. Punctuation (one class, 16 two classes) 50. Description of the calculation of the Chinese part of speech tagging set (total 99, 22 categories, 66 two classes, 11 three categories) are mainly used in the Chinese Academy of Science and Technology Institute for the Development of lexical analyzer, syntactic parser and Chinese-English machine translation system. This tag set mainly refers to the following part of the speech tag set: 1. The corpus of speech tagging in the people's Daily of Peking University ; 2. Peking University 2002 New edition of Part-of-speech tagging (draft); 3. The Chinese tree Library of Tsinghua University, part-of-speech tagging set; 4. Ministry of Education's Pragmatic language tagging set (draft national recommendation 2002); 5. The Chinese Tree Library (Chinesepenntreebank) of the University of Pennsylvania is a part of the POS tag; Using the "People's daily" corpus to train the parameters, so this part of the part-of-speech tagging is based on the set of part-of-speech markers of the People's Daily Corpus in Peking University, and reference the grammatical information of Chinese words given in the Chinese Grammar Information Dictionary of Peking University. The following factors are considered in the development of this tag set: 1. It is helpful to improve the segmentation and labeling accuracy of Chinese lexical analyzer; 2. It is helpful to improve the accuracy of Chinese syntactic parser; 3. It is helpful for Chinese-English machine translation system to be translated; 4. Easy conversion from Peking University's People's Daily corpus part-of-speech tagging set 5. For words with different grammatical functions, it is as fine as possible to make the molecular classes as thin and difficult as to differentiate between lexical analysis and syntactic ambiguity. Based on the above considerations, we try to avoid errors in the labeling process, and use those that are not prone to error, but to improve the accuracy of Chinese lexical syntactic analysis of the obvious effect of the mark. For example, in the subclass of a verb, we refer to the Chinese tree Library of the University of Pennsylvania, which makes the Chinese verb "yes" and "yes" separate, instead of the "system verb" mark. BecauseThe same verb is "yes", which has many syntactic functions and is only one of the functions of the "system verb", and it is very difficult to distinguish these functions, which leads to the decrease of the correct rate of lexical analysis. In noun subclasses, we distinguish between "Chinese names", "Japanese names" and "names of people", not only because these three kinds of names need to be trained and identified by different parameters, but also to be translated in Chinese-English machine translation using different analytic algorithms. As another example, we combine the term "numeral +" year "(such as" 1995 ") into a time word, and the year" numeral + ' "is labeled as" numeral "and" quantifier ", this is because we experimentally found that this distinction in the lexical analysis phase through statistical methods can achieve higher accuracy, Moreover, this distinction is very important for subsequent syntactic analysis and machine translation. For some parts of speech (particle and punctuation), it is basically a closed set, and the grammatical functions of each word in these parts of speech vary greatly, in which case we subdivide its subclasses as much as possible. In addition, similar to other parts of speech tagging, in our markup system, the small class is only some of the necessary special cases in the big class, but the division of the small class does not satisfy the completeness. 1. Nouns (a class, 7 classes of two, 5 three classes) are divided into the following sub-categories: N noun nr name nr1 Chinese surname NR2 Chinese name NRJ Japanese names NRF transliteration name NS name NSF transliteration place name NT Agency Group name NZ Other proper names nl noun idioms ng noun morphemes 2. Time words (A class, a two Class) T-time term TG time word of speech morpheme 3. Place Word (one class) s quarter word 4. noun (one Class) F locality 5. Verb (one class, 9 two Class) v verb vd vice verb vn noun verb vshi verb "yes" vyou verb "have" VF trend verb VX form move Word VI intransitive verb (inner verb) VL verb idiom VG verb morpheme 6. Adjective (a class, 4 two classes) a adjective ad secondary word an noun word ag adjective morpheme al adjective idiom 7. Difference Word (a class, 2 two Class) B differential word BL distinguishes part of speech idiom 8. State Word ( 1 categories) Z-state Word 9. Pronouns (one class, 4 two classes, 6 three classes) R pronoun RR personal pronoun rz demonstrative pronoun rzt time demonstrative pronoun rzs quarter demonstrative pronoun rzv predicate part of speech demonstrative pronoun ry interrogative pronoun ryt time interrogative pronoun rys premises interrogative pronoun ryv predicate part of speech interrogative pronoun RG pronoun Sex morpheme 10. Numeral (a class, a two Class) m numeral MQ number of words 11. quantifier (one class, 2 two Class) Q quantifier QV verb quantifier 12. Adverb (one Class) d adverb 13. Preposition (a class, 2 two classes) p preposition PBA preposition "put" pbeI preposition "by" 14. Conjunctions (a class, a two Class) C conjunctions and a parallel conjunction 15. Auxiliary particles (a class, 15 two classes) u auxiliary word uzhe ule Uguo ude1 the bottom ude2 to Ude3 Usuo and so on and so on Udeng like general  Like the udh words of the ULS say Uzhi Ulian even ("Even elementary school students") 16. interjection (one Class) E interjection 17. Modal Words (one class) y modal words (delete yg) 18. Quasi-Sound words (one Class) O quasi-sound words 19. Prefixes (one Class) H Prefix 20. suffix (one class) K suffix 21. String (one class, 2 two classes) x string xx non-morpheme word Xu URL URL22. Punctuation (one class, 16 two classes) W punctuation wkz opening parenthesis, full width: (([{"〖〈 Half-width: ([{ <wky right parenthesis, full width:)]} "〗〉 half-width:)] {>wyz, full-width: ' '" wyy closing quotation mark, full width: "'" WJ period, full width:. WW question mark, full angle:? Half-width:? WT exclamation mark, full angle:! Half-width:!wd comma, full-width:, half-width:, WF semicolon, full-width:; half-width:; wn comma, full-width:, wm colon, full-width:: Half-width:: ws ellipsis, full-width: ... wp dash, full angle:-half angle:-------wb percent percent Number, full width:%‰ Half angle:%WH unit symbol, full width: ¥$£°℃ half angle: $

  

The set of part-of-speech tagging-computing Chinese

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.