Language model of Natural language Processing (LM) __NLP

Source: Internet
Author: User

After a few days of understanding of NLP, let's talk about the language model, which is given in PPT below.

A statistical language model 1, what is the statistical language model.

A language model is usually constructed as the probability distribution P (s) of the string s, where P (s) actually reflects the probability that s appears as a sentence.

the probability here refers to the composition of the string, and the likelihood appearing in the training corpus is irrelevant to whether the sentence is grammatically grammatical or not. Assuming that the training corpus is derived from human language, it is possible to assume that this probability is the probability that a sentence is a word. 2, how to establish a statistical language model.

For a sentence consisting of T words in order, p (s) is actually solving the joint probability of the string, using the Bayesian formula, the chain decomposition is as follows:


As can be seen from the above, a statistical language model can be expressed, given the preceding word, to find the conditional probability of the subsequent word.

we have actually established a model when we ask P (s), where P (*) is the parameter of the model, and if these parameters have been solved, it is easy to get the probability of the string s. 3, solve the problem

Assuming that the string s is "I want to drink some water", then according to the model established above:


The problem boils down to how to solve every probability above, for example, a more intuitive way is to calculate the "I want to" and "I want to drink" in the corpus of the frequency, and then use division:


Looks like a good thing, actually there are two problems: (1) Number of free parameters:

Assuming that the characters in the string all come from a dictionary of the size V, we need to calculate all the conditional probabilities for all the conditional probabilities, and the W has a V-value for all the conditions, so in fact the number of free arguments in this model is v^6,6 to the length of the string.

As can be seen from the above, the free parameters of the model are incremented by the number of strings as the length of the string increases, which makes it almost impossible to estimate these parameters correctly. (2) sparsity of data:

from the above you can see that each w has a V-type value, so that a lot of word pairs, but the actual training corpus is not so many combinations, then according to the maximum likelihood estimate, the ultimate probability is very likely to be 0. 4, how to solve.

This paper puts forward two problems of traditional statistical language model, and then introduces two methods to solve them: N-gram language model, neural probabilistic language model Second, N-gram language model 1, what is the N-gram language model.

in order to solve the problem of excessive number of free parameters, Markov hypothesis is introduced: the probability of random word appearing is only related to the finite n words appearing in front of it. The statistical language model based on the above hypothesis is called the N-gram language model. 2, how to determine the value of N.

Normally, the value of n cannot be too large, or the problem of too many free arguments persists:

(1) When n=1, that is, the appearance of a word and its surrounding word is independent, this we call the Unigram, that is, a unary language model, at this point the free parameter level is the dictionary size v.

(2) When n=2, that is, the appearance of a word is only related to a word in front of it, this we call Bigram, called two-yuan language model, also known as the first-order Markov chain, at this point the number of free parameters is v^2.

(3) When n=3, that is, the appearance of a word only with its previous two words, called Trigram, called Ternary language model, also known as the Ishimarkov chain, at this time the number of free parameters is v^3.

In general, only the above values are used, as it can be seen from the above that the order of magnitude of the free argument is the number of times that n takes a value.

from the effect of the model, theoretically the greater the value of N, the better the effect. However, with the increase of n value, the amplitude of effect improvement is decreasing. At the same time, it involves a problem of reliability and discrimination, the more parameters, the better the distinction, but also the case of a single parameter becomes less and lower the reliability. 3. Modeling and solving

the solution of the N-gram language model is consistent with the traditional statistical language model, which is to solve each conditional probability value, simply compute the frequency of N-ary syntax appearing in corpus, and then normalized. 4. Smoothing

We put forward two problems in the traditional statistical language model: The number of free parameters and sparse data, the above N-gram only solves the first problem, and smoothing is to solve the second problem.

If there is a phrase in the training corpus does not appear, then its frequency is 0, but actually can not think that it appears probability of 0. Obviously not, we can't guarantee the completeness of the training corpus. So, what is the solution. If we default every phrase appears 1 times, regardless of the frequency of the phrase appears to add up to 1, which can solve the problem of 0 probability.

The above method is to add 1 smoothing, also known as Laplace smoothing. There are a number of ways to smooth it out, and there's no introduction:

(1) Addition smoothing

(2) Goodwood-Turing Smoothing

(3) k Smoothing A model of neural probabilistic language 1. Pre-placement knowledge

In the N-gram language model, the method of calculating conditional probability is simply to divide the word frequency and then normalized.

In the field of machine learning, the common practice is to construct an objective function for the problem after modeling it, then to optimize the objective function, then to obtain a set of optimal parameters, and finally use the corresponding model of this set of parameters to predict.

So in the above language model, the objective function is set to use the maximum logarithm likelihood:


The context represents the N-gram of the word w, which corresponds to the first N-1 word of the word W. The objective function is then maximized, visible from the top, and the probability is actually a function of W and:


where Theta is a set of undetermined parameters, the calculation of all conditional probabilities is converted to the optimal objective function, and the process of obtaining θ is solved. By selecting the appropriate model, the number of θ parameters can be much less than the number of parameters in the N-gram model. 2, what is the neural probabilistic language model.

Begio and other people in 2003 published a neural probabilistic Language Model, which detailed this method.

The basic idea is actually related to the preceding knowledge, since it is a neural probabilistic language model, then there is a natural neural network in the implementation, the structure diagram is as follows:


It consists of four layers: input layer, projection layer, hidden layer and output layer. 2, the calculation flow (1) input layer

Here is the context of the word w, if the N-gram method is the first n-1 word of the word W. Each word is used as a one-hot vector of length V in the afferent Neural network (2) projection layer

In the projection layer, there is a look-up table c,c represented as a v*m free parameter matrix, where V is the size of the dictionary, and M as a custom parameter, typically a multiple of 10^2.

Each row in table C exists as a word vector, which can be understood as another distributed representation of each word. Each one-hot vector is transformed into a word vector through the transformation of Table C.

The n-1 word vector is spelled together and converted into a column vector (n-1) m into the next layer. (3) hidden layer, output layer

The column vectors are then computed, roughly as follows:


The tanh is the activation function, is the normalized log probability, then uses the Softmax to carry on the normalization, then obtains the final probability output.

In the predecessor knowledge we mentioned the parameter θ, then in the neural network, the actual parameters are as follows:

Word vectors: V (w), W, and fill vectors

Neural network parameters: W,p,u,q 3, the last

In the traditional statistical language model, we propose two questions: the number of free parameters and sparse data.

Here in fact the use of parameter θ instead of the free parameter exponential level of the solution, and the data sparse problem, we at the end of the use of Softmax for normalization, the solution is the probability is smooth, so also solve the problem.















Reference: (ppt source Little Elephant Academy Sching Teacher)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.