From rule to statistic and statistical language model

Source: Internet
Author: User

from rule to statistic and statistical language modelhttp://blog.csdn.net/u012637501 I. Natural language processing-from rules to statistics 1. Rule-based Natural language processing in the the 1960s, the question before scientists was how to make machines understand natural language. At that time, the common understanding was to do two things first, namely parsing the statement (grammar) and acquiring semantics. As Western linguists have made a very formal summary of various natural languages, learning grammatical rules, speech and word-formation is especially important for learning Western languages, and the grammatical rules are easily described by computer algorithms. All the scientists at the time were more determined that rule-based is the best way to deal with natural language. However, it turns out that the only grammatical (grammatical) rule-based parser is not able to handle a slightly more complex sentence, with two main problems: First, the number of grammatical rules (which do not include the rules of part-of-speech tagging) is at least tens of thousands of, if you want to overwrite (correctly describe) even 20% of the true sentence through grammatical (grammatical) rules; second, even if it is possible to write a set of grammatical rules that cover all natural language phenomena, it is quite difficult to use a computer to parse it, because the grammar of natural language in reality is a more complex context-related grammar, whereas program language is a context-free grammar that we think is designed to facilitate computer decoding. By the the 1970s, the rule-based statement analysis revealed a larger problem: the ambiguity of natural language morphemes is difficult to describe by rules, but rather relies heavily on context and common sense, so that natural language processing based on rules eventually comes to an end. 2. Statistics-based natural language processing1970 Friedrich Jarinik and his IBM Watson Labs applied statistical methods to natural language processing (speech recognition), allowing natural language processing to regain new life and achieve today's extraordinary achievements. By using a statistical-based approach, IBM increased the rate of speech recognition from 70% to 90%, while the scale of speech recognition rose from hundreds of words to tens of thousands of words, making speech recognition possible from the lab to the application. At this point, asCarnegie Mellon UniversityThe doctoral student's Kai-Fu Lee was one of the first to move from a rule-based natural language approach to one based on statistical methods. In the 70 's, the core model of statistics-based method was the communication system plus hidden Markov model . The input and output of this system are all one-dimensional symbolic sequences, and maintain the original order. Over the past 25 years, with the improvement of computing power and the increasing of statistic data, the processing of complex statements has been realized by statistical model.
second, statistical language model 1. Using mathematical methods to describe the law of languageNatural speech is a kind of context-related information expression and transmission method, if you want to make computer processing natural language, the key problem to solve is: the natural voice of this context-sensitive characteristics of the establishment of mathematical models, also known as statistical language model. For the statistical language model, Jarinik is described as follows: If a sentence is reasonable, see how likely the sentence will appear in the corpus, which is measured by probability. The Mathematical description of the statistical language model is as follows:Suppose S denotes a meaningful sentence, consisting of a sequence of words w1,w2,w3,...., wn, where n is the length of the sentence. Now, we're going to know the likelihood that s will appear in the corpus text, which is mathematically called S probability p (s). It is not possible here for us to calculate the probability of the occurrence of S by counting the words that human beings have ever said, but to estimate it by a model (P (S)), and by S=w1,w2,....., WN, the probability that the statement s appears in the corpus text isP (S) =p (w1,w2,...., wn). Using the formula of conditional probabilities, the probability of theoccurrence of S is equal to the conditional probability of each word appearing , so P (w1,w2,..., wn), i.e.P (w1,w2,..., wn) =p (W1) *p (W2|W1) *p (w3| ( W1,W2)) *.......*p (wn|w1,w2,..., wn-1), where P (W1) represents the concept of the first word W1 appearing; P (W2|W1) is the probability that the second word appears on the premise that the first word is known. 2. Markov assumptions (1) Markov hypothesis    Because the above mathematical model is very difficult to calculate, especially to the last word w n n-1 " too many possibilities to estimate. To this end, from 19th century to the beginning of 20th century, Russia has a mathematician called Markov, he gave a lazy but more effective method, that is, Whenever you encounter this situation , it is assumed that the probability of the occurrence of any word wi is only related to the word wi-1 in front of it , which is the famous Markov hypothesis. The mathematical model is described as follows: P (w1,w2,..., wn) =p (W1) *p (W2|W1) *p (w3|w2) * ... P (wi|wi-1) .... *p (wn|wn-1) (2) Two-yuan model of statistical language model (Bigram Mode) The two-yuan model of statistical language model refers to the probability that any word WI appears only with the word w i-1 in front of it . P (w1,w2,..., wn) =p (W1) *p (W2|W1) *p (w3|w2) * ... P (wi|wi-1) .... *p (wn|w n-1), where the estimation of the conditional probability P (wi|wi-1) is based on the definition of the conditional probability asP (w i|wi-1) =p (wi-1,wi) /P (wi-1) =# (wi-1,wi)/# (wi-1). Note: p ( w i-1 , w i Span style= "line-height:1.6875") is a joint probability; p (w i-1 # (w i-1,w i w i-1,wi The number of occurrences of this word in the text of the statistic,# (wi-1) is the number of occurrences of itself in the same corpus text (# To the size of the corpus). (3) n-ary modelthe N-element model of the statistical language model, which can be expressed as a word, is determined by the preceding N-1 words, and the mathematical model is described as follows:   P (w1,w2,..., wn) =p (W1) *p (W2|W1) *p (w3| ( W1,W2)) *.......*p (wn|w1,w2,..., wn-1)This hypothesis is also called the N-1-order Markov hypothesis, and the corresponding language model is called N-meta model. When N=2 's two-dollar model is a two-dollar model, N=1 's unary model is actually a context-independent model. The most of the practical application is the ternary model of n=3, but the higher order is seldom used. Because the size of the N-ary model (which can also be calledspatial complexity) is almost a pointer function of n, i.e. O (| v|^n), where | V| is the vocabulary of a language dictionary, usually between tens of thousands of and hundreds of thousands of. While using the N-ary model speed (time complexity) is also almost a pointer function, i.e. O (| v|^n-1). In turn, when n is not very large (such as n from 1 to 2, from 2 to 3 o'clock), the effect of the model rises significantly. However, when the model from 3 to 4 o'clock, the effect of ascension is not very significant, and the cost of resources increase is very fast. 3. Training of statistical language models (1) Training of the model using a language model requires knowing all the conditional probabilities in the model , which we call the model's parameters. Through the statistics of Corpus , the process of obtaining these parameters is called Model training. such as the two-dollar model:P (w1,w2,..., wn) =p (W1) *p (W2|W1) *p (w3|w2) * ... P (wi|wi-1) .... *p (wn|wn-1), where the conditional probabilities of P (W1), P (W2|W1), P (W3|W2), P (wi|wi-1), P (wn|wn-1) are the parameters of the model. (2) Factors affecting training resultsThe amount of data in the corpus. When the corpus size of the speech model is very small, the conditional probability of some words is greatly deviated (such as P (wi|wi-1) =p (WI-1,WI)/P (wi-1) =# (WI-1,WI)/# (wi-1) =0/1), which results in 0 probability problem. Therefore, the training of the model, the larger the size of the corpus data, the better. (3) Selection of Corpustraining data, the corpus, is the key in model training. First, the corpus data is large enough, and secondly, the field of training corpus and model application should not be disjointed. For example, to create a language model, if the application is a web search, its training data should be messy Web page data and user input search string, rather than traditional, normative press releases, even if the former mixed with noise and errors. of course, when the training data and the application data are consistent and the training volume is large enough, the noise level of the training corpus will affect the effect of the model, so it is necessary to preprocess the training data before training. In the case of low cost, it is necessary to filter the training data.

From rule to statistic and statistical language model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.