Natural language Processing Second speaking: Word Count

Source: Internet
Author: User
Tags alphanumeric characters

Natural language Processing: Word count This is the main content (today): 1, Corpus and its nature, 2, ZIPF Law, 3, Annotated Corpus example, 4, the word segmentation algorithm;  one, corpus and its properties: a) What is corpus (corpora) i. A corpus is a vector of naturally occurring language texts, stored in machine-readable form, and ii. A balanced corpus tries to be representative in language or other fields; b) Translator Note: The characteristics and differences of parallel corpus and balance Corpus I. Parallel corpora are usually composed of bilingual or multilingual correspondence corpus, often the translation text composition. For example: Babel English-chinese Parallel Corpus. Parallel corpora are often used for comparison and translation studies. The balance corpus (balanced corpus) mainly refers to the sample of its corpus is balanced and representative. This corpus can be used as a general conclusion about the characteristics of a language. For example: Lancaster Corpus of Mandarin Chinese and Academia Sinica Balanced Corpus of modern Chinesec) word count (Word Counts) I. The most common in text What are the words? II. How many words are there in the text?      Iii. What are the characteristics of word distribution in large-scale corpora? d) We take the Adventures of Tom Sawyer, Mark Twain, as an example: Word (word) frequency (Freq) usage (use) the 3332 determiner (article) and 2972 conjunction a 1775 determiner to 1725 preposition, INF. Marker of 1440 Prepositi On is 1161 auxiliary verb it 1027 pronoun in 906 preposition that 877 complementize R Tom 678 proper name I. Some observations (Some observations): 1. The function words accounted for the majority; 2. Corpus-dependent Subject headings also account for, for example, "Tom" Ii. Thinking: Is it possible to create a truly "representative" English-likeThis corpus? e) How many words are in this sentence: they picnicked by the pool, then lay back to the grass and looked at the stars. I. "Type"--the number of different words in the Corpus, Dictionary volume Ii. Example (Token)-The total number of words in the corpus (words in a corpus) Iii. Note: The above definitions are referenced from the "Natural language Processing" IV. The Adventures of Tom Sawyer (Tom Sawyer) are: 1. Word type (Word types)-8, 018 2. Example Word (Word tokens)-71, 370 3. Average frequency (average frequency)-9 (Note: Word case/Word type) f) frequency (frequencies of frequencies): frequency of Word frequency (frequency in frequency)             ) 1 3993 2 1292 3 664 4 410 5 243 6 199 7 172 8 131 9 82 10 91 11-50 540 51-100 99 Most words in the corpus only Appears once (most words of a corpus appear only once)! Second, Zipf law (Zipf's Laws) a) the frequency of the nth most commonly used word in any natural language is inversely proportional to the n approximation. b) The Zipf law indicates the relationship between frequency (f) and rank? The following: F #= 1/r (note: The formula Editor cannot be used here, approximately) c) there is a constant k represented as follows: f* r = KD) Tom Sawyer's Zipf law (Zipf's laws in Tom Sawye) words (word) frequency (Freq. (f)) ranking (rank (R)) F?      Rthe 3332 1 3332and 2972 2 5944a1775 3 5235he 877 8770but 410 8400BE 294 8820ther E 222 8880one 172 8600about 158 9480never 124 80 99 20Oh 10440e) Translator Note: Supplementary explanation--wiki Zipf law I. Fundamentally speaking, ZIPF law can be expressed as a natural language corpus in which the frequency of a word appears inversely proportional to its ranking in the frequency table. Therefore, the frequency of the most frequent words is about twice times the frequency of the second word, and the second word of the frequency is twice times the number of words that occur in the fourth place. This law is used as a reference for any thing related to power law probability distributions. This "Law" was published by Kingsley Zipf, a linguist at Harvard University. Ii. For example, in the Brown Library, the "the" is the most common word, which appears in this library for about 7% (69,971 occurrences in 100,000 words). As described in Zipf's law, the word "of", which appears as the second digit, takes up 3.5% (36,411) of the entire library, followed by "and" (28,852 times). Just 135 words accounted for half of the Brown language library. Iii. Zipf Law is an experimental law, not a theoretical one. The Zipf distribution can be observed in many phenomena. The cause of zipf distribution in reality is a point of contention. The Zipf law is easy to see with a bitmap, with coordinates of log (rank) and log (frequency). For example, the phrase "the" can be described as x = log (1), y = log (69971) points. If all points are close to a straight line, then it follows the Zipf law. The simplest example of the Zipf law is the "F/F function". Given the frequency of a set of Zipf distributions, the second common frequency is 1/2 of the number of occurrences of the most common frequencies, according to the most common to the non-common arrangement. The third most common frequency is 1/3 of the most common frequency. The common frequency of the nth frequency is the 1/n of the most common frequencies. However, this is not accurate, because all items must have an integer number, and a word cannot occur 2.5 times. However, inWithin a wide area and to make appropriate approximations, many natural phenomena conform to Zipf's law. f) Zipf law and the principle of labor saving I. Human behavior and labor-saving principles: 1. “... Zipf argues that he found a unifying principle, the principle of Least Effort, which underlies essentially the entire Huma N Condition (the book even includes some questionable remarks on human sexuality!). The principle argues that people would act so as to minimize their probable average rate of work ". (Manning&schutze, p.23) ii. Note: Peking University Kang Wang Qi Teacher's "Zipf and labor saving principle" is very good, some excerpts are as follows: 1. The principle of labor saving (the Principle of Least Effort), also known as the economic principle (the Economy Principle), can be summed up as: for the greatest benefit at the lowest cost. This is a fundamental principle in guiding human behavior.  In modern academia, the first to explicitly put this principle is the American scholar George Kingsley Zipf. 2. George Kingsley Zipf1902 was born in January to a German family (whose grandfather emigrated to the United States in the middle of 19th century). In 1924, he graduated from Harvard College with honors. 1925 in Germany Bonn, Berlin studied. 1929 completed relative Frequency as a determinant of Phonetic change and received a PhD in comparative philology from Harvard. Then he started teaching German at Harvard. 1931 married to Joyce Waters Brown. 1932 published selected studies of the Principle of Relative Frequency in Language. Published 1935 the psycho-biology of Language:an Introduction to Dynamic philology. He was hired as a lecturer in 1939. Published 1949 Human Behavior and the PrinciPle of Least Effort:an Introduction to Human ecology.  died of cancer in September 1950. 3. In the 1949 book, Zipf presented a basic principle for guiding human behaviour-the principles of energy-saving. Zipf in the preamble that if we look at human behavior purely as a natural phenomenon, if we study human behavior as if we were studying the social behavior of bees and nesting habits of birds, then it is possible to reveal the underlying principles behind them. This is the background of his "labor-saving principle". When Zipf found a regularity similar to the law of Zipf in many unrelated phenomena, he began to think about the cause of this regularity. This is the direct factor that led him to propose a "labor-saving principle". Prior to the commencement of formal argumentation, Zipf first clarified the literal meaning of the "labor-saving principle". First, this is an average amount. A person in a lifetime to experience a lot of things, his effort on one thing can lead to the effort on another thing. In turn, the effort on one thing can lead to effort on the other. Second, this is a probability. It is difficult for a person to be absolutely sure that a certain method will make him less effort in advance, he can only have a ballpark estimate. Because the study of words is the key to understanding the whole process of speech, and the latter is the key to understanding the whole human ecology, his concrete argument begins with the word economy. Zipf that the use of the word economy can be discussed from two angles: the speaker's angle and the listener's angle. From the point of view of the speaker, it is most economical to express all meanings in one word. In this way, the speaker does not have to spend the effort to master more words, nor does he need to consider how to choose a suitable word from a bunch of words. This "single word vocabulary" is like a multi-purpose woodworking tool, sawing and drilling hammer in one, can meet a variety of purposes. However, from the listener's point of view, this "single word vocabulary" is the most laborious. It is almost impossible for him to decide what the word means on a particular occasion. On the contrary, for the hearer, the most labor-saving is that each word has only one meaning, the form and meaning of the word is exactly one by one correspondence. These two kinds of economic principles are conflicting and contradictory to each other. Zipf call them a two opposing force in a verbal flow: the single force of unification and the forces of diversity (the force of diversification). He believes that the two forces can only achieve a compromise and strike a balance in order to realize the real effort. The facts are as expected. See Zipf's argument that if there is only single power, then the number of words in any discourse will be 1, and the number of occurrences (frequency) would be 100%. On the other hand, if there is only a variety of forces, then each word will appear nearly 1, and the total number of wordsis determined by the length of the text. This means that number and frequency are two parameters to measure the degree of lexical balance. g) Other laws (other laws): I. Word sense distribution (words sense distribution); Phonemic distribution (phonemes distribution); Word co-occurrence mode (word co-occurrence patterns), h) approximate obedience to Zipf law (Examples of collections approximately obeying Zipf ' s laws): I. Frequency of access to the Web page (F Requency of accesses to Web pages); The size of the residence point (Sizes of settlements); Distribution of personal income (Income distribution amongst individuals); Iv. The magnitude of the earthquake (size of earthquakes); music symbols in play (Notes in musical performances);  three, corpus-related a) data sparse problem (sparsity) I. "Kick" in 1 million words The number of occurrences?——-ii. The number of "kick a Ball" appeared in 1 million words?—— 0 iii. How much of the "kick" appears in the Web?—— 6M iv. "Kick a Ball" appears in the Web how many?—— 8.000 v. Data never too much B) very very large data I. Brill&banko 2001: The result of training in mixed set disambiguation tasks by increasing the data size is much better than the results of the best system trained on the standard training Corpus 1. Task: Ambiguity elimination of word pairs such as "too,to" 2. Training scale (Training size): From 1 million words to 1 billion words ranging from 3. Learning algorithms for contrast: Winnow algorithm, perceptron algorithm, decision tree algorithm Ii. Lapata&keller 2002, 2003:web can be used to do very very large corpus (the Web can be used as a very very large corpus) 1. The count may be disturbed by noise, but it's not a big problem for some tasks (the counts can be noisy, but for some tasks THis was not a issue) c) Brown corpus (The Brown Corpus) I. Famous early corpus (Famous early Corpus) (made by Nelson Francis and Henry Kucera a T Brown University in the 1960s) 1. A balanced corpus of American written language (a balanced corpus of written American 中文版), including newspapers, novels, non-fiction, academic genres (newspaper, novels, Non-fiction, Academic) 2. 1 million words, 500 texts (1 million words, written texts) 3. Do you think it is a large corpus (do you think this is a large corpus)? Ii. Note, a more detailed introduction to the Brown Corpus: 1.  In the 1960s, Francis and Kucera established the world's first standard corpus, the brown corpus, to collect samples based on systematic principles at the University of Brown in the United States. 2. The main objective is to study contemporary American English 3.  The corpus of text collected in accordance with the principle of co-time, only clip the text of the ordinary style of writing published by Americans in the 1961 years.  4. The scale is 1 million words, all corpus is divided into 15 kinds of genres, a total of 500 samples, each sample not less than 2000 words. 5. Taggit System: 81 Kinds of parts of speech, the correct rate of 77% 6.  Corpus A-r A total of 18 types, A-j belongs to the information type of style, K-r belongs to the imagination of the type of language: A Press: news report; B Press: Editorial ... 7. Samples are obtained by random sampling method. First, the text of the corpus is randomly selected according to the sample number in all kinds of genre catalogue, then the fragments of not less than 2000 words are randomly intercepted from the selected text as samples, and the last sentence is guaranteed to be complete 8 when sampling. Version: A,b,c, Bergen I, Bergen II, Brown Marc 9. The brown corpus has been carefully designed from the whole size of corpus, the distribution of corpus and the sampling of corpus, which is generally regarded as a balance corpus which can reflect the generality of language. D) Recent corpus (recent Corpora) corpus (Corpus) size (size) domain (domain) language (Language) NA News Corpus-million Newswire American EnGlishbritish national Corpus million balanced British ENGLISHEU proceedings $ Million legal language pair Spenn Treebank 2 million Newswire American Englishbroadcast News spoken 7 Languagesswitchboard 2.4 milli On spoken American 中文版 Ii. To learn more about the corpus, consult the Language Data federation (for more corpora, check the linguistic data Consortium): http://www.ldc.upenn.edu/ e) Corpus content (Corpus Content) I. Type (GENRE): – News, novel, broadcast, session (Newswires, novels, broadcast, spontaneous conversations) Ii. Media (Media): text, audio, video (text, audios, videos) iii. Callout (Annotations): tokenization, Syntax tree (syntactic trees), semantics (semantic senses), translation (translations) f) Callout example (Example of Annotations): Part-of-Speech tagging (POS Tagging) I. Part-of-speech tagging set to simple grammatical function coding (pos tags encode easy grammatical functions) ii. Several POS tagging sets (Several tag sets): 1. Penn tag Set (2 tags). Brown tag Set (tags) 3. CLAWS2 tag Set (tags) III. Example: Category Example claws c5 Brown Penn adverb often, badly AJ0 JJ JJ Noun singular table,Rose NN1 nn nn Noun plural tables, roses NN2 nn nn Noun proper singular Boston, Leslie NP0 NP nn PG) Labeling problems (issues in Annotations) I also think that different labeling schemes are normal (Different annotation schemes for the same task is common) Ii. In some cases, there is a direct mapping between the schemes, and in other cases they do not show any relationship (in some cases, there is a direct mapping between schemes; In other cases, they does not exhibit any regular relation) Iii. The selection of annotations is driven by language, calculation and/or task needs (Choice of annotation is motivated by the linguistic, the computational and/or the task requirements ) Iv. tokenization I. Target (GOAL): Splits the text into a word sequence (divide text into a sequence of words) ii. A word refers to a string of consecutive alphanumeric characters with spaces on both ends, and may contain hyphens and apostrophes but no other punctuation (word is a string of contiguous alphanumeric characters with space on either Si De may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis)) Iii. Is it easy to Tokenizatioan (is tokenization)? b) What is a word? I. english:1. "Wash. VS Wash" 2. "won ' t", "John ' s" 3. "Pro-arab", "The idea of a child-as-required-yuppie-possessionMust be motivating them "," 85-year-old grandmother "Ii. East Asian languages: 1. No space between words c) word breaker (Word segmentation) I. Rule-based approach: morphological analysis based on dictionary and grammatical knowledge ii. Corpus-Based approach: learning from the corpus iii. Issues to consider: coverage, ambiguity, accuracy (coverage, ambiguity, accuracy) d) motivation for the statistical segmentation method (motivation for statistical segmentation) I. Unknown words problem:--existence of domain terms and proper nouns (presence of domain terms and proper names) Ii. The grammatical constraints might be inadequate (grammatical constrains may not be sufficient)--Example (Example): Alternate segmentation of noun phrases (alternative segmentation of noun phrases ) III. Example 11. Segmentation:sha-choh/ken/gyoh-mu/bu-choh 2. Translation: "President/and/business/general/manager" Iv. Example 21. Segmentation:sha-choh/ken-gyoh/mu/bu-choh 2. Translation: "President/subsidiary business/tsutomi[a name]/general Manage) A segmentation algorithm: I. Core idea: For each candidate boundary, Compares the frequency of the n-ary sequence adjacent to this boundary and the frequency of the n-ary sequence across the boundary. II. NOTE: Due to the problem of formula editing, the specific algorithm please refer to lec02.pdf, here slightly. f) Experimental Framework (Experimental Framework) I. Corpus (CORPUS): 150 trillion 1993 Nikkei news corpus (megabytes of 1993 Nikkei newswire) Ii. Manual slicing (Manual annotations): 50 sequences for the development set (tuneSection parameters) and 50 sequences for the test set (sequences for development set (parameter tuning) and the sequences for test set) Iii. Baseline algorithm (Baseline algorithms): Chasen and Juma morphology Analyzer g) The evaluation method (Evaluation Measures) I. Tp-true positive (TRUE, TP) is predicted by the model as positive sample; Ii. Fp-false positive (pseudo positive, FP) is predicted as positive negative samples by the model; Tn-true negative (true negative, TN) are negative samples predicted by the model; Fn-false negative (false negative, FN) is predicted as negative positive samples by the model, v. Accuracy (Precision)-the measure of the proportion of the selected items that the Sys TEM got RIGHT:P = TP/(TP + FP) VI. Recall rate (Recall)-the measure of the target items that the system SELECTED:R = TP/(TP + FN) VII. F Value (f-measure): F = 2? PR/(R + P) VIII. Word Precision (P) is the percentage of proposed brackets, the match word-level brackets in the annotation; IX. Word recall (R) is the percentage of word-level brackets, which is proposed by the algorithm. Five. CONCLUSION (conclusions) a) corpus was Widely used in text processing (corpora widely used in text processing) b) The corpus used is a mature corpus or corpus (Corpora used either annotated or raw) c) Zipf law and its associated with natural language System (ZIPF ' s Law and ITS connection to natural language) d) data sparse is a major problem in corpus processing (sparsity is a major problem for corpus processing methods)

Natural Language Processing Second speaking: Word count

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.