[NLP paper Reading] The Fixed-size ordinally-forgetting Encoding method for neural network Language

Source: Internet
Author: User

Original: The Fixed-size ordinally-forgetting Encoding method for neural network Language Models Introduction

This paper presents a method of learning indefinite long sequence representation, and uses this method for the language model of Feedforward neural Networks (Feedforward neural network language models, Fnn-lms) and obtains good experimental data. The author realizes the improvement of the FNN language model by replacing the fnn-lms of the original input layer with the Fofe coded sequence. fixed-size ordinally Forgetting Encoding

The given thesaurus size (vocabulary size) is represented by the K,fofe using One-hot encoding, each word, a k-dimensional vector, to represent the word. Fofe uses the following formula to encode an indefinite length sequence:
Zt=α∗zt−1+et (1≤t≤t) z_t=\alpha*z_{t-1}+e_t (1\leq T\leq t)
Wherein, ZT z_t represents the Fofe encoding of a subsequence of W1 w_1 from the first word in the input sequence up to the T-word wt w_t (assuming z0=0 z_0=0), Α\alpha forgeting factor (constant), et e_t is the word wt w_t The corresponding one-hot vector.
Then, ZT z_t can be regarded as a vector representation of the sequence w1,w2,..., WT {w_1, w_2, ..., w_t}.
For example, if the glossary is
a=[1,0,0]
b=[0,1,0]
c=[0,0,1]
So, by calculating, you can get
abc=[α2,α,1] {abc}=[\alpha^2, \alpha, 1]
Abcbc=[α4,α+α3,1+α2] {abcbc}=[\alpha^4, \alpha+\alpha^3, 1+\alpha^2]

The Fofe code has 2 better properties:
1. If 0<α≤0.5, then Fofe is unique to any K and T.
2. If 0.5<α<1, then Fofe is unique for most k and T, with only a finite alpha value being the exception. Model


The traditional neural probabilistic language model (Bengio) uses the one-hot vector in the input layer, then the word vector matrix is mapped into a low dimensional real-valued vector (if the N-gram model and the word vector dimension is m), then the word vector corresponding to the first n-1 word is connected to a m (n-1 ), and then the output layer is formed by the computation of the hidden layer.

In this article, the author's changes are made at the input level. The author replaces the One-hot vector of the original output layer with the Fofe encoding. In the process of n-1 a word before using FOFE encoding, the effect of earlier words on the final encoding is gradually weakened, which means that the word closer to the target word is more likely to affect the target word. Also, using FOFE encoding can reduce the dimension of the vector generated in the projection layer, if it is a 1-order Fofe Fnn-lms, then the dimension of the projection layer is m (the dimension of the word vector), but this does not reduce the complexity. Experiment

The authors carried out comparative experiments in 2 datasets.
1. The Penn Treebank (PTB) corpus (about 1000000 words with a thesaurus size of 10000)
2. The Large Text Compression Benchmark (LTCB), in which the author uses the ENWIK9 data set, is the Enwiki-20060303-pages-articles.xml's first 109 10^9 byte data , where the training set of 153M, validation set 8.9M, test set 8.9M, thesaurus size is 80000, not the word in the glossary with < UNK > tag.

The

Generally evaluates a language model for good or bad use of the index is the confusion/confusion/confusion (preplexity), the basic idea is to give the sentence of the test set a higher probability of the language model is better, when the language model training, the test set of sentences are normal sentences, Then the training model is the higher the probability of the test set, the better the , the specific formula is as follows:
PP

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.