Original: The Fixed-size ordinally-forgetting Encoding method for neural network Language Models Introduction
This paper presents a method of learning indefinite long sequence representation, and uses this method for the language model of Feedforward neural Networks (Feedforward neural network language models, Fnn-lms) and obtains good experimental data. The author realizes the improvement of the FNN language model by replacing the fnn-lms of the original input layer with the Fofe coded sequence. fixed-size ordinally Forgetting Encoding
The given thesaurus size (vocabulary size) is represented by the K,fofe using One-hot encoding, each word, a k-dimensional vector, to represent the word. Fofe uses the following formula to encode an indefinite length sequence:
Zt=α∗zt−1+et (1≤t≤t) z_t=\alpha*z_{t-1}+e_t (1\leq T\leq t)
Wherein, ZT z_t represents the Fofe encoding of a subsequence of W1 w_1 from the first word in the input sequence up to the T-word wt w_t (assuming z0=0 z_0=0), Α\alpha forgeting factor (constant), et e_t is the word wt w_t The corresponding one-hot vector.
Then, ZT z_t can be regarded as a vector representation of the sequence w1,w2,..., WT {w_1, w_2, ..., w_t}.
For example, if the glossary is
a=[1,0,0]
b=[0,1,0]
c=[0,0,1]
So, by calculating, you can get
abc=[α2,α,1] {abc}=[\alpha^2, \alpha, 1]
Abcbc=[α4,α+α3,1+α2] {abcbc}=[\alpha^4, \alpha+\alpha^3, 1+\alpha^2]
The Fofe code has 2 better properties:
1. If 0<α≤0.5, then Fofe is unique to any K and T.
2. If 0.5<α<1, then Fofe is unique for most k and T, with only a finite alpha value being the exception. Model
The traditional neural probabilistic language model (Bengio) uses the one-hot vector in the input layer, then the word vector matrix is mapped into a low dimensional real-valued vector (if the N-gram model and the word vector dimension is m), then the word vector corresponding to the first n-1 word is connected to a m (n-1 ), and then the output layer is formed by the computation of the hidden layer.
In this article, the author's changes are made at the input level. The author replaces the One-hot vector of the original output layer with the Fofe encoding. In the process of n-1 a word before using FOFE encoding, the effect of earlier words on the final encoding is gradually weakened, which means that the word closer to the target word is more likely to affect the target word. Also, using FOFE encoding can reduce the dimension of the vector generated in the projection layer, if it is a 1-order Fofe Fnn-lms, then the dimension of the projection layer is m (the dimension of the word vector), but this does not reduce the complexity. Experiment
The authors carried out comparative experiments in 2 datasets.
1. The Penn Treebank (PTB) corpus (about 1000000 words with a thesaurus size of 10000)
2. The Large Text Compression Benchmark (LTCB), in which the author uses the ENWIK9 data set, is the Enwiki-20060303-pages-articles.xml's first 109 10^9 byte data , where the training set of 153M, validation set 8.9M, test set 8.9M, thesaurus size is 80000, not the word in the glossary with < UNK > tag.
The
Generally evaluates a language model for good or bad use of the index is the confusion/confusion/confusion (preplexity), the basic idea is to give the sentence of the test set a higher probability of the language model is better, when the language model training, the test set of sentences are normal sentences, Then the training model is the higher the probability of the test set, the better the , the specific formula is as follows:
PP