July algorithm December machine learning online Class---20th lesson notes---deep learning--rnn
July algorithm (julyedu.com) December machine Learning Online class study note http://www.julyedu.com
- Cyclic neural networks
Before reviewing the knowledge points:
Fully connected forward network: learning is a function
Convolutional networks: Convolution operations, partial links, shared operations, layer-wise extraction of the characteristics of the original image (voice, NLP)
The characteristics of learning
Local correlation
Shallow wide network is difficult to make neural network
?
1.1 States and models
1, ID data
• Classification issues
• Regression Problems
• Feature Expression
2, Most of the data do not meet the ID
• Most of the data does not meet the old
• Sequence analysis (Tagging, Annotation)
• Sequence generation, such as language translation, automatic text generation
• Content extraction (contents Extraction), image description)
You need to add the previous state to the current layer
1.2 Sequence Samples
1, input and output mapping relationship (application of sequence)
A, one-to-one: normal neural networks, without loops
B. One-to-many, look at the picture and talk
C. Many-to-one: emotional judgment
D: Many-to-many: language translation
E: sequence to sequence l/r/u/d
· RNN is not only able to process the output of the sequence, but also to get the sequence output, where sequence
Refers to a sequence of vectors.
. RNN Learning is a program, not a function .
?
1.3 Sequence Predictions
• The input is a sequence of time-varying vectors:
. The model is estimated at t time:
?
Problem
• Difficult to model and observe internal state
• Difficult to model and observe for long time-range scenarios (context)
• Solution: Introduce internal implicit state variables
The internal state, corresponding to the position
?
1.4 Sequence Prediction Model
• Input discrete column sequences
• Update calculation in time t
The above two graphs are equivalent to the H of the last t-1 moment and the current moment, together with the output.
• Predictive Computing
- W remains constant throughout the calculation process
- H initialization at 0 hours
?
1.4 RNN Training (1)
1, forward calculation, the same w matrix needs to multiply multiple times
2, the input x before the multi-step, will affect the current output
3, in the back calculation, the same matrix will be multiplied multiple times
1.4.1 B PTT algorithm one backprop Through time
1,RNN forward Calculation
2, to calculate the bias of the W, you need to add all time step, the loss function of each step is the same
3, apply chain rules
?
1.4.2 BPTT algorithm: Computational implementation
Chain rules for targets, using the differential of vectors
The calculation goal is sum,
If the sequence of 16, W transpose to multiply 16 times, resulting in an explosion phenomenon , according to time, easy to happen, there is a connection,, ordinary network, W has big small, gradient disappears, not very serious, each layer of w is not the same
?
Analysis of gradient vanishing/exploding phenomenon of BPTT algorithm
?
?
Solution of the 1.4.3 BPTT algorithm
1, clipping
2, W is initialized to 1, the activation function is replaced with Relu to Tanh
?
2 LSTM (long short term memory) Cell long memory capability
Through the structure of the method to solve the phenomenon of gradient dispersion and gradient explosion, to avoid a W from start to finish, with a certain common sense memory ability
The most widely used and successful RNN
?
2.1 Cell State (unit status)
?
1, you can save a state for a long time, the cell state value through the forget GAT (multiplication in the picture) control to preserve how much "old" status,
2, layer turns input dimension x into output dimension h
?
2.2 Forget/input Unit
As for Yes [0,1],b is the offset
2.3 Update Cell
2.4 Output
For summary, four matrix WF, WI,WC,WO
?
July algorithm December machine learning online Class---20th lesson notes---deep learning--rnn