The basic concept of language model

Source: Internet
Author: User
Tags manual writing ming

Original address: http://blog.csdn.net/mspinyin/article/details/6137815#t12

Now the research of natural language processing is definitely a very hot direction, which is mainly driven by the current Internet development. In the internet is flooded with a lot of information, mainly the text of information, the processing of these information is inseparable from the natural language processing technology. So what exactly is natural language and natural language processing?

basic tasks for natural language processing

Natural language (Natural Language) is actually human language, natural language processing (NLP) is the processing of human language, of course, mainly using computers. Natural language processing is a cross-disciplinary study of computer science and linguistics, and common research tasks include:

· Participle (word segmentation or Word BREAKER,WB)

· Information extraction (Information Extraction,ie): Named entity recognition and Relationship extraction (Named entity recognition & Relation Extraction,ner)

· POS Callout (part of Speech Tagging,pos)

· Finger digestion (coreference Resolution)

· Syntactic analysis (parsing)

· Sense disambiguation (Word sense DISAMBIGUATION,WSD)

· Speech recognition (Speech recognition)

· Speech synthesis (Text to Speech,tts)

· MT (Machine TRANSLATION,MT)

· Auto Digest (Automatic summarization)

· Question and answer system (Question answering)

· Natural language Understanding (Natural Language Understanding)

· Ocr

· Information retrieval (information Retrieval,ir)

The early natural language processing system is mainly based on the rules of Manual writing, which is time-consuming and difficult to cover all kinds of language phenomena. In the late 80, machine learning algorithms were introduced into natural language processing, thanks to the increasing computational power. The research focuses on statistical models, this method uses the large-scale training corpus (corpus) to automatically learn the parameters of the model, compared with the previous rule-based method, this method is more robust.

Statistical language Model

Statistical language models (statistical Language model) are mentioned in this context and background. It is widely used in various natural language processing problems, such as speech recognition, machine translation, Word segmentation, pos tagging, and so on. To put it simply, a language model is a model for calculating the probability of a sentence, i.e.

Using a language model, you can determine which word sequence is more likely, or given several words, to predict the next most likely word. For example, the input pinyin string for Nixianzaiganshenme, the corresponding output can have a variety of forms, such as what you do now , you XI ' an , and so on, So which is the right conversion result, using the language model, we know that the former probability is greater than the latter, so it is more reasonable to convert the former in most cases. Another example of machine translation, given a Chinese sentence for Li Ming is watching TV at home , can be translated as Li Ming is watching TV at home, Li Ming at home iswatching tv, and so on, also according to the language model, we know the probability of the former is greater than the latter, so translation into the former more reasonable.

So how do you calculate the probability of a sentence? Given sentence (Word sequence)

Its probability can be expressed as:

(1)

Because there are too many parameters in the above, an approximate calculation method is required. Common methods include N-gram model method, decision tree method, Maximum Entropy model method, maximum Entropy Markov model method, conditional random domain method, neural network method, and so on.

N-gram Language Model N-gram The concept of a model

The N-gram model, also known as the n-1-order Markov model, has a finite historical hypothesis: the probability of the occurrence of the current word is only related to the n-1 of the preceding word. Thus the (1) formula can be approximated as:

(2)

When n takes 1, 2, 3, the N-gram model is called Unigram, Bigram, and Trigram language models, respectively. The parameter of the N-gram model is the conditional probability

Assuming the size of the thesaurus is 100,000, the number of parameters for the N-gram model is

The larger the N, the more accurate and complex the model is, and the greater the amount of computation needed. The most commonly used is bigram, followed by Unigram and Trigram,n taking ≥4 less.

N-gram parameter estimation of the model

The parameter estimation of the model is also called the training of the model, and the method of maximum likelihood estimation (Maximum likelihood estimation,mle) is used to estimate the parameters of the model:

(3)

C (x) indicates the number of times x appears in the training corpus and the greater the size of the training corpus, the more reliable the result of the parameter estimation. However, even if the training data is very large, such as a few gigabytes, there will be many language phenomena in the training corpus did not appear, which will lead to many parameters (the probability of an n-ary pair) is 0. For example, IBM Brown uses 366M English corpus to train Trigram, and in the test corpus, 14.7% of Trigram and 2.2% of Bigram did not appear in training, according to the laboratory statistics of the Ph. Using the 5 million word People's Daily training Bigram model, 1.5 million words of the people's Daily as a test corpus, the results of 23.12% Bigram did not appear.

This problem, also known as Data sparseness, solves the problem of data thinning that can be solved by data smoothing (smoothing) technology.

N-gram Data Smoothing for models

Data smoothing is an n-ary pair with a frequency of 0, and a typical smoothing algorithm is additive smoothing, good-turing smoothing, Katz smoothing, interpolation smoothing, and so on.

· Addition Smoothing

The basic idea is to avoid the 0 probability problem by adding a constant δ (0<δ≤1) to each n-ary pair of occurrences:

(4)

· Good-turing Smoothing

Use the frequency category information to smooth the frequency:

(5)

where N (c) represents the number of n-gram with a frequency of C.

· Linear interpolation Smoothing

This data smoothing technique mainly uses the low-element N-gram model to interpolate the Takamoto n-gram model linearly. Low-N-gram models can often provide useful information when there is not enough data to estimate the probability of Takamoto n-gram models.

(6)

Can be estimated by EM algorithm.

· Katz Smoothing

Also known as fallback (Back-off) smoothing, its basic idea is that when an n-ary pair is large enough, the probability is estimated by the maximum likelihood estimation method, and when the number of n-ary pairs is not large enough, the good-turing estimation is smoothed, and the partial probability discount is given to the non-appearing n-elements. When the N-ary pair has a number of occurrences of 0 o'clock, the model is rolled back to the low-element model.

(6)

Parameter and the normalized constraint condition of the parameter probability of the model, namely.

N-gram decoding algorithm of model

Why does the N-gram model need decoding algorithms? For example, for the word conversion problem, the input pinyin nixianzaiganshenme, may correspond to many conversion results, for this example, the possible conversion results as shown (only part of the word node), the nodes constitute a complex network structure, Any path from start to finish is a possible conversion result, and the process of selecting the most appropriate result from many conversion results requires a decoding algorithm.

The commonly used decoding algorithm is the Viterbi algorithm, which uses the principle of dynamic programming to quickly determine the most suitable path. The algorithm is not described in detail here.

N-gram application of the model

The application of N-gram language model is very extensive, the earliest application is speech recognition, machine translation and so on. Professor Wang Xiaolong of Harbin Institute of Technology first applied it to the word conversion problem, put forward "sentence-level Pinyin input Method", later the technology transferred to Microsoft, that is, later Microsoft Pinyin Input method. Starting with Windows95, the input method is automatically installed and the latest Microsoft Pinyin Input method will be integrated in later versions of Windows and Office Office software. N years later, the various input methods of the rookie (such as Sogou and Google) have also adopted the N-gram technology.

Reference Documents:       

[1] Microsoft Pinyin Input Method

[2] Natural language processing

[3] Language model

[4] Introduction to statistical language models

[5] Language model handout

The basic concept of language model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.