CSC321 Neural Network language model RNN-LSTM

Source: Internet
Author: User

Two main areas

    1. Probabilistic modeling

      Probabilistic modeling, neural network models try to predict a probability distribution

    2. cross-entropy as a function of error, we can make the observed data

      Give a higher probability value

      at the same time can solve saturation the problem

    3. Reduced- dimensional effect of the linear hidden layer mentioned earlier ( reduction of training parameters )

??

This is an initial version of the Neural network language model

??

    1. choose what to loss function, why use cross-entropy, why not squared loss it?

First you can see that cross-entropy is more capable of representing the true difference in the 0.01,0.0001 two predictions.

the other point is saturation.

consider the sigmoid activation unit output y

consider the correspondence between cost and Z

??

because cross-entropy words are convex , there is no local optimal solution but only the global optimal solution, so it is easier Optimize

??

  1. Linear hidden Layer Unit

    Lookup, which is equivalent to the embedding matrix , R is the embedding matrix or lookup table

    So how to play the dimensionality reduction effect, how to reduce the training parameters?

    ??

    Embedding Descending Dimension

  2. limitations of current neural network language models

    This language model is actually the continuous bag of words model (Cbow) corresponding to Word2vec 's skip-gram model.

    Word2vec is from a word predicting the surrounding word this is from the surrounding word Prediction Center word , language model specific is the former several words predict the current word

    It's a surprise that we can only use similar NGRAM, follow Markov assumption and not use more information than the first few words

    ??

    But sometimes the long-distance context is meaningful, like

    ??

  3. RNN Model

    RNN model can solve the above problems, can learn the long-distance dependence

    ??

    This is a simple RNN example, the input plus.

  4. RNN 's Training

    This is the same backprop algorithm as the normal neural network training before, and it 's Backprop, but there are two new questions.

    • Weight Constraints Weight Limit
    • Exploding and vanishing gradients the explosion and disappearance of gradients

??

5.1 about the weight limit

That is, the output weight of all units is limited to the same as time.

an example of a hidden to hidden weight

??

Concrete Example of a RNNLM implementation reference http://www.cnblogs.com/rocketfan/p/4953290.html about rnnlm Diagram and introduction.

??

5.2 about the explosion and disappearance of gradients

??

The real problem is not the Backprop but the long-distance dependence is very complex, and the gradient explosion and disappearance is easy to pass in the Backprop process superposition occurs.

Gradients greater than 1 are constantly transmitted to bring gradient explosions, and gradients that are less than 1 are constantly passed to bring gradients to extinction.

??

Solutions to solve rnn gradient explosions and disappearance:

    1. LSTM
    2. Reverse the input or output sequence so that the network can see a close-up dependency before trying to learn difficult long-distance
    3. Gradient truncation

the third method adopted by RNNLM,FATER-RNNLM, forcing truncated gradients to avoid gradient explosions

??

    1. LSTM

LSTM solves this problem by replacing a single unit with a complex memory unit .

??

TensorFlow examples of LSTM

Https://github.com/jikexueyuanwiki/tensorflow-zh/blob/master/SOURCE/tutorials/recurrent/index.md

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

It is mentioned herethat RNN can learn historical information when the distance is short, but RNN is powerless when the distance is longer .

example of a short distance, predicting sky

long-distance examples, predictions French

??

the diagram below is very clear . Common rnn, corresponding values make a simple nonlinear element such as Sigmoid,tanh

struct? sigmoidactivation ?: Public Iactivation {

void? Forward (real* hidden, int size) ? {

for (int i = 0; i < size; i++) {

Hidden[i] = exp (hidden[i])/(1 + exp (hidden[i])) ;

}

}

??

From

??

??

??

The LSTM a single neural network layer into 4 .

??

The LSTM does has the ability to remove or add information to the cell State, carefully regulated by structures called GA Tes.

??

From

??

LSTM can Remove or add information to the cell state via Gates ' condition.

??

The first step is to discard the information

Forget gate layer

It looks at? hT ? 1. and? x T , and outputs a number between? 0. and? 1 ? For each number in the cell state? Ct ? 1. A? 1. Represents "completely keep this" while a? 0 ? represents "completely get rid of this."

??

by combining the previous output and the current input, output a 0-1 direct value (sigmoid),1 for all reservations, and0 for all discards.

give an example, such as a language model, that combines the gender information of the current subject to determine if the current pronoun is she,he?

When encountering new topics we need to forget about the gender information of previous topics

The second step is to determine what information to keep

Examples of corresponding language models, new topics we add gender information for current topics

Two changes sigmoid input layer gate + tanh layer

??

Step Three: Combine the first two steps, discard the previous gender information + Add current gender information

??

Fourth Step: Final Output

Finally, we need to decide what we ' re going to output. This output would be based to our cell state, but would be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we ' re going to output. Then, we put the cell state Through?tanh? (to-push the values to be between? ) ? 1?and?1) and multiply it by the output of the "Sigmoid gate, so" we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in CA Se that's what's coming next. For example, it might output whether the subject are singular or plural, so then we know what form a verb should be conjuga Ted into if that's what follows next.

??

From

??

??

CSC321 Neural Network language model RNN-LSTM

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.