The neural probability language model consists of three parts: Question, model and criterion, and experimental results. [The content of this section is pending...]
1. Language Model Problems
The problem with the language model is that a language dictionary contains v words, and binary inference is made for a string to determine whether it meets the language expression habits. That is, the value is 0 or 1.
The probability language model relaxed its limit on values so that the value range is 0 to 0 ~ Values between 1 (Language Model v. s), and the sum of probabilities of all strings is 1. Wikipedia interprets the probability language model as a string consisting of a probability distribution and a probability assigned to a word. However, it should be noted that it is unrealistic to directly calculate the probability distribution, because the number of such strings is infinite theoretically. Directly finding the probability distribution will lead to a dimension disaster.
To solve this problem, we first introduce the chain rule and think that the probability of occurrence of the I character in the string is determined by the first I-1 character. The following formula is used:
However, with this formula, it is still too complicated. If the formula is simplified, the probability of occurrence of the I character in the string is determined by the first n-1 characters (that is, if. In this way, the formula is simplified:
The model is now very easy, that is, the probability of computing conditions. That is, the probability that each word in the dictionary appears after a given string is calculated.
2. models and guidelines
2.1 data samples with some mark numbers.
2.2 Model
Figure 1. Model Diagram
Picture from: http://licstar.net/archives/328
Modeling steps
2.2.1 ing: map the input words to m-dimensional word vectors through the ing table (table look-up process in Figure 1.The table to be queried here is not given. It is an additional product obtained in model learning (that is, vector stuff produced by word2vec ).
2.2.2 linear transformation: linear transformation of n-1 m-Dimensional Vectors in 2.2.1 into (n-1) * m-dimensional vectors by means of first-end concatenation. That is, at the input of the tanh layer in the middle (below), the full part vector C (W) is merged into a vector with a large dimension of (n-1) * m [Mikolov's rnnlm changes here: not only the information of N-1 words is viewed, but all the word information before the word].
2.2.3 nonlinear transformation: Perform nonlinear transformation at the middle Tanh layer. Here we need a transformation matrix and a bias matrix. That is, for the middle layer (the layer where Tanh is located), the input of this layer is (n-1) * M vector, and the output is. After linear transformation, the previous (n-1) * m-Dimensional Vectors become h-dimensional vectors.
2.2.4 output processing: process the output at the last layer (softmax Layer. Here we need a transformation matrix and a bias matrix. The final output is. Note that the output is a vector of the V dimension, which is consistent with the dimension of the dictionary D in section 1. The real number corresponding to each dimension in the vector is the probability of outputting the word.
2.3 Guidelines
For all training samples, the criterion is to minimize the number. Here it is a regular item. The gradient descent method can be used for calculation.
The number of dimensions that need to be manually set in the model is the number of elements N of the model, the dimension m of the word vector, and the output dimension H of the hidden layer.
Number of dimensions to be optimized: W and H in the transformation matrix, D and B in the offset matrix. The word vector table used in the table is the number of dimensions to be optimized, that is, the number of shards we obtain.
3. Experiment results
Language Model confusions. Model confusions are a method used to evaluate the quality of different language models (another method is word error rate, mikolov's doctoral thesis statistical language Models Based on Neural Networks introduces and compares the two methods ). Given a replica set, the smaller the confusion of the model on the replica set, the better.
Test Set integration:
Brown Corpus, a total of w words, 80 W training, 20 W verification, other 18 w as the benchmark test set.
When n = 5, M = 30, H = 100, the ppl of nnlm is 270. The current optimal n-gram model (n = 3) of this attention set has a ppl of 312. Set the weight to 252 for the ppl after model fusion.
Test Set 2:
AP News, a total of words, of which W training, W verification, other w as the benchmark test set.
When n = 6, M = 100, H = 109, the ppl of nnlm is 109. The current optimal n-gram model ppl of this attention set is 117.
Online learning materials:
Blog about Neural Network Language Model: Click to open the link
Mikolov's doctoral thesis statistical language Models Based on Neural Networks makes the evaluation of n-gram language models simple and useful (N-gram models are today still considered as state of the art not because there are no better techniques, but because those better techniques are computationally much more complex, and provide just marginal done, not critical for success of given application.), The key to the model is to select the N value and smooth technology. Its inherent disadvantages include:
First, the N value of the n-gram model cannot be greater. Because the number of n-gram segments of the model increases exponentially with the N value. This determines that the n-gram model cannot effectively use longer context information. In particular, when a large number of training corpus is provided, the n-gram model cannot effectively capture long-distance language phenomena.
Second, even if the N value is greater than the limit value, n-gram cannot use context information over long distances. For example, there is a sentenceThe sky above our heads is bleu.Words in this sentenceBleuFor wordsSkyThere is a very strong dependency, no matter how many variables are inserted between the two words, this relationship will not be broken. For example, the sky this morning was bleu. however, for the n-gram model, even if the N value limit is lifted and a large N value is obtained, such a long-distance language phenomenon cannot be effectively captured.
Third, the n-gram model cannot efficiently identify similar words. For example, the training corpus exists.Party will be on Monday.AndParty will be on Tuesday.The model cannotPArty will be on FridayThe. sentence gives a higher probability. Although we can clearly understand"Monday","Tuesday","Friday"These are similar concepts, but n-grams that are modeled literally cannot be identified.
Deep Learning (deep learning) Study Notes (4)