Two main areas
- Probabilistic modeling
Probabilistic modeling, neural network models try to predict a probability distribution
- cross-entropy as a function of error, we can make the observed data
Give a higher probability value
at the same time can solve saturation the problem
- Reduced- dimensional effect of the linear hidden layer mentioned earlier ( reduction of training parameters )
??
This is an initial version of the Neural network language model
??
- choose what to loss function, why use cross-entropy, why not squared loss it?
First you can see that cross-entropy is more capable of representing the true difference in the 0.01,0.0001 two predictions.
the other point is saturation.
consider the sigmoid activation unit output y
consider the correspondence between cost and Z
??
because cross-entropy words are convex , there is no local optimal solution but only the global optimal solution, so it is easier Optimize
??
- Linear hidden Layer Unit
Lookup, which is equivalent to the embedding matrix , R is the embedding matrix or lookup table
So how to play the dimensionality reduction effect, how to reduce the training parameters?
??
Embedding Descending Dimension
- limitations of current neural network language models
This language model is actually the continuous bag of words model (Cbow) corresponding to Word2vec 's skip-gram model.
Word2vec is from a word predicting the surrounding word this is from the surrounding word Prediction Center word , language model specific is the former several words predict the current word
It's a surprise that we can only use similar NGRAM, follow Markov assumption and not use more information than the first few words
??
But sometimes the long-distance context is meaningful, like
??
- RNN Model
RNN model can solve the above problems, can learn the long-distance dependence
??
This is a simple RNN example, the input plus.
- RNN 's Training
This is the same backprop algorithm as the normal neural network training before, and it 's Backprop, but there are two new questions.
- Weight Constraints Weight Limit
- Exploding and vanishing gradients the explosion and disappearance of gradients
??
5.1 about the weight limit
That is, the output weight of all units is limited to the same as time.
an example of a hidden to hidden weight
??
Concrete Example of a RNNLM implementation reference http://www.cnblogs.com/rocketfan/p/4953290.html about rnnlm Diagram and introduction.
??
5.2 about the explosion and disappearance of gradients
??
The real problem is not the Backprop but the long-distance dependence is very complex, and the gradient explosion and disappearance is easy to pass in the Backprop process superposition occurs.
Gradients greater than 1 are constantly transmitted to bring gradient explosions, and gradients that are less than 1 are constantly passed to bring gradients to extinction.
??
Solutions to solve rnn gradient explosions and disappearance:
- LSTM
- Reverse the input or output sequence so that the network can see a close-up dependency before trying to learn difficult long-distance
- Gradient truncation
the third method adopted by RNNLM,FATER-RNNLM, forcing truncated gradients to avoid gradient explosions
??
- LSTM
LSTM solves this problem by replacing a single unit with a complex memory unit .
??
TensorFlow examples of LSTM
Https://github.com/jikexueyuanwiki/tensorflow-zh/blob/master/SOURCE/tutorials/recurrent/index.md
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
It is mentioned herethat RNN can learn historical information when the distance is short, but RNN is powerless when the distance is longer .
example of a short distance, predicting sky
long-distance examples, predictions French
??
the diagram below is very clear . Common rnn, corresponding values make a simple nonlinear element such as Sigmoid,tanh
struct? sigmoidactivation ?: Public Iactivation {
void? Forward (real* hidden, int size) ? {
for (int i = 0; i < size; i++) {
Hidden[i] = exp (hidden[i])/(1 + exp (hidden[i])) ;
}
}
??
From
??
??
??
The LSTM a single neural network layer into 4 .
??
The LSTM does has the ability to remove or add information to the cell State, carefully regulated by structures called GA Tes.
??
From
??
LSTM can Remove or add information to the cell state via Gates ' condition.
??
The first step is to discard the information
Forget gate layer
It looks at? hT ? 1. and? x T , and outputs a number between? 0. and? 1 ? For each number in the cell state? Ct ? 1. A? 1. Represents "completely keep this" while a? 0 ? represents "completely get rid of this."
??
by combining the previous output and the current input, output a 0-1 direct value (sigmoid),1 for all reservations, and0 for all discards.
give an example, such as a language model, that combines the gender information of the current subject to determine if the current pronoun is she,he?
When encountering new topics we need to forget about the gender information of previous topics
The second step is to determine what information to keep
Examples of corresponding language models, new topics we add gender information for current topics
Two changes sigmoid input layer gate + tanh layer
??
Step Three: Combine the first two steps, discard the previous gender information + Add current gender information
??
Fourth Step: Final Output
Finally, we need to decide what we ' re going to output. This output would be based to our cell state, but would be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we ' re going to output. Then, we put the cell state Through?tanh? (to-push the values to be between? ) ? 1?and?1) and multiply it by the output of the "Sigmoid gate, so" we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in CA Se that's what's coming next. For example, it might output whether the subject are singular or plural, so then we know what form a verb should be conjuga Ted into if that's what follows next.
??
From
??
??
CSC321 Neural Network language model RNN-LSTM