CSC321 Neural Network language model RNN-LSTM

Last Update:2015-12-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Two main areas

Probabilistic modeling
Probabilistic modeling, neural network models try to predict a probability distribution
cross-entropy as a function of error, we can make the observed data
Give a higher probability value
at the same time can solve saturation the problem
Reduced- dimensional effect of the linear hidden layer mentioned earlier ( reduction of training parameters )

This is an initial version of the Neural network language model

choose what to loss function, why use cross-entropy, why not squared loss it?

First you can see that cross-entropy is more capable of representing the true difference in the 0.01,0.0001 two predictions.

the other point is saturation.

consider the sigmoid activation unit output y

consider the correspondence between cost and Z

because cross-entropy words are convex , there is no local optimal solution but only the global optimal solution, so it is easier Optimize

Linear hidden Layer Unit
Lookup, which is equivalent to the embedding matrix , R is the embedding matrix or lookup table
So how to play the dimensionality reduction effect, how to reduce the training parameters?
??
Embedding Descending Dimension
limitations of current neural network language models
This language model is actually the continuous bag of words model (Cbow) corresponding to Word2vec 's skip-gram model.
Word2vec is from a word predicting the surrounding word this is from the surrounding word Prediction Center word , language model specific is the former several words predict the current word
It's a surprise that we can only use similar NGRAM, follow Markov assumption and not use more information than the first few words
??
But sometimes the long-distance context is meaningful, like
??
RNN Model
RNN model can solve the above problems, can learn the long-distance dependence
??
This is a simple RNN example, the input plus.
RNN 's Training
This is the same backprop algorithm as the normal neural network training before, and it 's Backprop, but there are two new questions.

Weight Constraints Weight Limit
Exploding and vanishing gradients the explosion and disappearance of gradients

5.1 about the weight limit

That is, the output weight of all units is limited to the same as time.

an example of a hidden to hidden weight

Concrete Example of a RNNLM implementation reference http://www.cnblogs.com/rocketfan/p/4953290.html about rnnlm Diagram and introduction.

5.2 about the explosion and disappearance of gradients

The real problem is not the Backprop but the long-distance dependence is very complex, and the gradient explosion and disappearance is easy to pass in the Backprop process superposition occurs.

Gradients greater than 1 are constantly transmitted to bring gradient explosions, and gradients that are less than 1 are constantly passed to bring gradients to extinction.

Solutions to solve rnn gradient explosions and disappearance:

LSTM
Reverse the input or output sequence so that the network can see a close-up dependency before trying to learn difficult long-distance
Gradient truncation

the third method adopted by RNNLM,FATER-RNNLM, forcing truncated gradients to avoid gradient explosions

LSTM

LSTM solves this problem by replacing a single unit with a complex memory unit .

TensorFlow examples of LSTM

Https://github.com/jikexueyuanwiki/tensorflow-zh/blob/master/SOURCE/tutorials/recurrent/index.md

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

It is mentioned herethat RNN can learn historical information when the distance is short, but RNN is powerless when the distance is longer .

example of a short distance, predicting sky

long-distance examples, predictions French

the diagram below is very clear . Common rnn, corresponding values make a simple nonlinear element such as Sigmoid,tanh

struct? sigmoidactivation ?: Public Iactivation {

void? Forward (real* hidden, int size) ? {

for (int i = 0; i < size; i++) {

Hidden[i] = exp (hidden[i])/(1 + exp (hidden[i])) ;

}

From

The LSTM a single neural network layer into 4 .

The LSTM does has the ability to remove or add information to the cell State, carefully regulated by structures called GA Tes.

From

LSTM can Remove or add information to the cell state via Gates ' condition.

The first step is to discard the information

Forget gate layer

It looks at? hT ? 1. and? x T , and outputs a number between? 0. and? 1 ? For each number in the cell state? Ct ? 1. A? 1. Represents "completely keep this" while a? 0 ? represents "completely get rid of this."

by combining the previous output and the current input, output a 0-1 direct value (sigmoid),1 for all reservations, and0 for all discards.

give an example, such as a language model, that combines the gender information of the current subject to determine if the current pronoun is she,he?

When encountering new topics we need to forget about the gender information of previous topics

The second step is to determine what information to keep

Examples of corresponding language models, new topics we add gender information for current topics

Two changes sigmoid input layer gate + tanh layer

Step Three: Combine the first two steps, discard the previous gender information + Add current gender information

Fourth Step: Final Output

Finally, we need to decide what we ' re going to output. This output would be based to our cell state, but would be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we ' re going to output. Then, we put the cell state Through?tanh? (to-push the values to be between? ) ? 1?and?1) and multiply it by the output of the "Sigmoid gate, so" we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in CA Se that's what's coming next. For example, it might output whether the subject are singular or plural, so then we know what form a verb should be conjuga Ted into if that's what follows next.

From

CSC321 Neural Network language model RNN-LSTM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CSC321 Neural Network language model RNN-LSTM

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

CSC321 Neural Network language model RNN-LSTM

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support