A neural probabilistic language model. This paper was published by begio and others in 2003. It can be said that it is the originator of the word expression. A brief translation is provided here.
A neural probabilistic Language Model
A neural probability Language Model
Abstract
One goal of the statistical language model is to learn the joint probability functions of word sequences in a language. Because of the dimension disaster, this is an essential difficulty: The word sequence tested by the model may be different from the sequence of all words that have been seen during training. The traditional but very successful n-gram-based method is generalized by connecting the overlapping sequences that appear in the training set very short. To solve the problem of dimension disaster, we propose to learn the distributed representation of words. This method allows each training statement to provide the model with information about the exponential number of semantic adjacent sentences. According to the description just now, this model learns (1) distributed representation of each word and (2) probability functions of the word sequence at the same time. The model can be generalized because it is a sequence of words that have never appeared. If it is composed of words similar to it (in a representative sense nearby) that have already appeared, then it gets a higher probability. Training such a large model (with millions of parameters) within a reasonable period of time is a significant challenge. The neural network-based probability function experiment is presented in two text corpus. This method significantly improves the most advanced n-gram model, in addition, this method can take advantage of long context advantages.
Keywords: statistical language model, artificial neural network, distributed representation, dimension disaster
1. Introduction
A fundamental problem that makes language modeling and other learning problems difficult is the dimension disaster. This problem is especially evident when a person wants to establish a Joint Distribution Model for many discrete random variables (such as words in sentences or discrete distributions in Data Mining tasks. For example, if a person wants to establish a Joint Distribution Model for 10 connected words whose vocabulary size is 100000 in a natural language, there will be 100 00010 −1 = 1050-1 free parameter. When we create models for continuous variables, we are more likely to get generalized (such as smooth class functions like multi-layer neural networks or Gaussian mixed models ), because the functions to be learned can be expected to have some lo-cal smoothness. For discrete spaces, the generalized structure is not obvious: any changes to these discrete random variables may have a great impact on the value of the function to be estimated, in addition, when the value range of each discrete variable is large, most of the observed objects are almost infinite in the Hamming distance.
A statistical language model can represent the conditional probability of a word after a given Prefix:
Where, wt indicatesTSub-sequences are written as wij. This language model has been found to be useful in many natural language processing fields, such as speech recognition, language translation, and information retrieval. The performance improvement of the statistical language model can significantly affect these applications.
When establishing a statistical language model, a method that can be considered to reduce the difficulty of the model is that words closer to the word sequence are more dependent. Therefore, the n-gram model establishes a conditional probability representation of the nth word given the n-1 words:
We only consider the combination of continuous words that appear in the training set, or words that appear frequently enough. What will happen when a new combination of N words not seen in the corpus appears? We don't want to assign them a probability of 0, because such a combination is indeed possible. A simple solution is to use a smaller context, namely, using tri-gram or smoothing the tri-gram. Essentially, a new word sequence is composed of frequent word segments in the training corpus by "bonding" very short overlapping. The rule for obtaining the probability of the next segment is the implicit rollback or discount n-gram algorithm. Researchers use a typical tri-gram of N = 3 and have achieved world-leading results. Obviously, the sequence that appears directly before a word carries more information than the previous one. The method proposed in this paper significantly improves the above problems in at least two features. First, the above method does not consider the context of more than 1 or 2 words; second, the above method does not consider the similarity between words. For example, the sequence "The Cat is walking in the bedroom" has been observed in the corpus, which can help us generate the sequence "A dog was running in a room ", because "dog" and "cat" have similar semantics and syntax roles.
There are many proposed methods to solve these two problems. In section 1.2, we will give a concise explanation. First, we will discuss the basic idea of the proposed method. A more formal introduction is provided in section 2. The implementation of these ideas uses multi-layer neural networks that share parameters. Another contribution of this paper is to introduce the efficient method of training such a large neural network for a large amount of data. Finally, an important contribution demonstrates that training such a large-scale model is expensive but worthwhile.
Many operations in this paper use matrix symbols, and lowercase letter V represents the column vector. V 'represents its transpose, AJ represents the J row of matrix A, and x. y represents x' y.
1.1 Use distributed representation to solve the dimension disaster
To put it simply, the idea of this method can be summarized into the following three steps:
1. Allocate a distributed word feature vector for each word in the Word Table
2. Joint probability functions of word sequences represented by feature vectors of words appearing in word sequences
3. Parameters for learning word feature vectors and probability functions
Word feature vectors represent different aspects of words: each word is associated with a point in the vector space. The number of features is much smaller than the word table size. The probability function is used by the table to produce the product of the conditional probability of a word after a given word (for example, in an experiment, a multilayer neural network is used to predict the next word in the previous word ). This function has some parameters that can be adjusted through iteration to maximize the logarithm likelihood function. These word-related feature vectors can be learned properly, but they can be initialized using prior semantic feature knowledge.
Why is this effective? In the previous example, if we know that "dog" and "cat" play similar roles (semantic or syntactic), similar to (the, A), (bedroom, room), (is, was), (running, walking), we can naturally
The cat is walking in the bedroom
Generate
A dog was running in a room
Or
The cat is runing in a room
A dog is walking in a bedroom
The dog was walking in the room
...
Or more combinations. In this model, these can be generated because similar words are expected to have similar feature vectors, and because probability functions are smoothing functions of these feature values, small changes in features will produce small changes in probability. Therefore, the appearance of one of these sentences in the corpus increases the probability of these sentences.
1.2 relationship with previous work
Neural Networks have been used to model high-dimensional discrete distribution and have been found to effectively learn their joint probability. In this model, the joint probability is decomposed into the product of the conditional probability.
G (x) is a function represented by a left-to-right neural network. The I-th output block GI is used to calculate the parameters that express the conditional probability of Z and Zi before given. Experiments on four UCI datasets prove that this method can work well. Here we must process variable-length data, such as sentences, so the above method must be deformed.
2. A Neural Model
The training set is a word sequence W1 ,..., WT, where WTVWord Table V is a large but limited set. The goal of the model is to learn a good function to estimate the conditional probability:
The following constraints must be met:
Wt indicates the T word of the word sequence, V indicates the word table, and | v | indicates the word table size. The joint probability of the word sequence can be obtained through the score of the conditional probability.
The function is divided into two parts:
1. A ing c, from any element in the Word Table I to the real vector C (I) ε RM. It represents the word distribution feature vector in the correlated Word Table. In practice, C is represented as a free parameter matrix | v | × m.
2. probability functions on words, expressed in C: A function g, from the context feature vectors of words in the input sequence, (C (wt-n + 1 ),..., C (WT-1), to the conditional probability distribution of the next word I in the word vocabulary. The output of G is a vector. The I-th element of the vector estimates the probability, as shown in
Figure 1 Neural Network Language Model Structure
Function f is a combination of the two ing C and G. These two mappings are associated with some parameters. The parameter mapped to C is the feature vector itself, which is expressed as a | v | × M matrix C. line I of C is the feature vector of word I. Function g can be implemented by a feed-forward neural network, convolutional neural network, or other parameterized functions.
The training is implemented to find θ to maximize the logarithm likelihood function of the training data.
Here, θ is the parameter, and R (θ) is the regular term. For example, in our experiment, R is a penalty for weight, just the weight of the neural network and the matrix C.
In the above model, the number of free parameters is a linear function of Word Table V size. The number of free parameters is also a linear function of the sequence length N.
In most of the experiments below, neural networks have a hidden layer that is hidden before word feature ing and directly connects word features to the output layer. Therefore, there are two hidden layers: feature layer C and hyperbolic tangent hidden layer C.
The output layer uses the softmax function:
Yi is the unnormalized log probability of each output word I. The calculation is as follows:
B, W, U, D, and H are both parameters. If X is input, θ = (B, W, U, D, H ). The hyperbolic tangent is connected by an element to the scope vector of an element. When the number of hidden units in the neural network is H and the word table size is | v |, B is | v | column vector of the dimension, W is | v | × (n-1) M matrix, U is | v | × H matrix, D is the column vector of H dimension, and H is the matrix of H × (n-1) M. Note that the general neural network input is not optimized, and here, x = (C (WT-1), C (wt-2 ),..., C (wt-n + 1) is also a parameter to be optimized. In Figure 4-1, if the original input x of the lower layer is not directly connected to the output, W = 0.
The number of free parameters is | v | (1 + nm + H) + H (1 + (n-1) M). The main factor is | v | (NM + H ).
If a random gradient algorithm is used, the gradient update Rule is as follows:
Here, ε is the learning rate ). Note that the input layer of a neural network is only an input value. Here, the input layer X is also a parameter (in C) and needs to be optimized. After optimization, the language model training is complete.
3. Implementation of Parallelism
Even if the number of parameters is a linear function of the input window size N and Word Table size | v |, that is, it is well limited, but the total calculated amount is far greater than n-gram. The main reason is that in the n-gram model, the specific P (wt | WT-1 ,..., Wt-n + 1) You do not need to calculate all probabilities in the Word Table because of simple normalization. The main bottleneck of neural network computing lies in the output layer. The Running Model (during training and testing) reduces computing time in a parallel computer. We have explored parallelism on two platforms: contributing memory processors and Linux clusters.
3.1 Concurrent Data Processing
Parallel processing is easy to implement under the shared memory processor, thanks to the low communication overhead. In this case, we select the implementation method of data parallelism, and each processor works on different data subsets. Each processor calculates the gradient of its training sample and executes the random gradient descent algorithm to update the shared parameters in the memory. Our first implementation is very low because the synchronization algorithm is used to avoid write conflicts. Most of the time on the processor is wasted waiting for other processors.
Instead, we choose asynchronous implementation. Each processor can write data to the shared memory at any time. Sometimes some updates are lost due to write/Write conflicts, which leads to some small noise of parameter updates. However, this noise is insignificant.
Unfortunately, large shared memory computers are expensive and their processors tend to lag behind CPU clusters. Therefore, we can get faster training on high-speed network clusters.
3.2 parallel parameter Processing
If the parallel computer is a CPU network, we usually cannot afford the overhead of too frequent parameter exchanges, because the parameter size is MB, which will consume a lot of time. Instead, we choose parameter parallel processing. In particular, parameters are the parameters of the output unit, because this is where most computing occurs in our architecture. Each CPU is responsible for calculating an output subset of the non-regularization probability. This policy allows us to implement a parallel random gradient descent algorithm with negligible communication overhead. In essence, the CPU needs to exchange two types of data: (1) Regularization factor at the output layer, (2) gradient at the hidden layer and word feature layer. All CPUs are copied before the output layer, but these calculations are negligible compared to the total amount of computing.
For example, consider the experiment on AP News: vocabulary size | v | = 17964, number of hidden layer units H = 60, sequence length n = 6, word feature vector dimension M = 100. The calculation workload of a single training sample is | v | (1 + nm + H) + H (1 + nm) + nm. In this example, the number of computations required in the output layer is
This computation is similar because the actual CPU time varies with different computing types, but it shows that the output layer of parallel computing has a positive impact. All CPUs need to copy a very small number of factors, which has little impact on the total computing time. If the data in the hidden layer unit is huge, parallel computing is also helpful. We will not do this here to prove it.
In the symbols used below, "." indicates Cartesian product, "'" indicates matrix transpose, and cpui (I value range: 0 ~ M-1) is responsible for calculating the output unit start number is starti = I × small | v |/M small, length is Min (small | v |/M small, | v |-starti).
The weight penalty regularization is not shown above, but can be implemented simply. Note that parameter updates are immediate instead of passing a parameter gradient vector, which improves the speed.
In the forward computing stage, some problems may occur. One of the problems is that PJ can all be 0, or one of them is very large and cannot perform exponential operations. To avoid this problem, the common solution is to subtract the largest number in YJ before the exponential calculation. Therefore, we can add an allreduce operation before calculating PJ to share the maximum value of YJ among m processors.
Effective parallelization can still be achieved in low-speed clusters. It is better to communicate with each training sample During computation than to communicate with each k Training Samples During computation. This requires storing K activation and gradients of the neural network. After the forward phase of K training samples, the sum of probabilities must be shared to the processor. Then K is initialized in the backward phase. After switching these gradient vectors, each processor can complete the backward phase and modify parameters. If K is too large, it will lead to non-convergence issues.
4. Experiment results
A comparative experiment was conducted on the 1181041 word sequences in the brwon corpus. The first 800000 words are used for training, the next 200000 words are used to adjust the model parameters, and the remaining 181041 are used for testing. The number of different words is 47587. The word frequency 3 is merged into one item. Reduce the word table size to | v | = 16383.
An experiment also runs on the text data of AP News of 1995 and 1996. The training set is a sequence of about 14 million, the size of the development set is about 1 million, and the test set is also a sequence of about 100. There are 148721 different words in the data. We can narrow down the word table to | v | = 17964. The method is to retain frequently-used words and convert uppercase letters to lowercase letters, merge numbers and special characters.
For neural networks, the initial learning rate is set to ε 0 = 0.001, and the formula ε t = ε 0/(1 + RT) is gradually used to narrow down, t indicates the number of updated parameters. r is the attenuation factor and the value is 10-8.
4.1 n-gram model
The first object to be compared is the trigram Model Using Interpolation and smoothing. The conditional probability of the model is expressed
Where, the conditional weight α I (QT)> 0, Sigma I α I (QT) = 1. P0 = 1/| v |, P1 (wt) is unigram, P2 (wt | WT-1) is bigram, P3 (wt │ WT-1, wt-2) is trigram. α I can be obtained through the EM algorithm, which requires about five iterations.
Result 4.2
Testing results for different models based on confusions.
We can see that the neural network language model has better performance than the best n-gram.
5. Conclusion
The experiment is conducted on two corpus, one with over 1 million training samples and the other with 15 million words. The experiment shows that the method proposed in this paper has obtained much better confusion values than the advanced trigram.
We believe the main reason is that this method allows us to learn the distribution representation to solve the problem of a dimension disaster. This model may be more improved, in terms of the model architecture, computing efficiency, and the use of prior knowledge. In the future, the priority research point should be to increase the training speed. A simple method to take advantage of the time structure and expand the size of the input window is to use a convolutional neural network. The more general work introduced here opens the door to improving the statistical language model method, and replaces the conditional probability table with a more smooth representation based on the distribution representation. In view of the many hard work of statistical language model research, which has been spent on limiting and summarizing conditional variables to prevent the issue of over-fitting, the methods introduced in this paper transfer this difficulty: more computing is needed, but the computing and memory demand scale are linear, rather than the exponential level of conditional variables.
A neural probabilistic Language Model