This set of notes is followed by the July algorithm May in-depth study of learning and recorded, mainly remember me to learn machine learning when some of the concepts are more vague, specific courses refer to the July Judge network:
http://www.julyedu.com/
RNN: Using neural networks to process the state and model of sequence problems
Before, the model we were dealing with was called IID data; the network uses sample A to do a forward, whether it is a classification or a regression, then the second time with B Forward,a and B does not matter.
This kind of network learns to be a function, enter x, get Y.
IID: Separate and distributed. The sample is independent from the sample.
More data does not meet the IID
such as serial data: Voice, video, image, text, etc.
Sequence data consists of two types: Time series: speech, Spatial sequence: image
Sequence generation, such as language translation, automatic text generation
Content extraction, such as image description
Sequence Samples
Sequence problems can be easily divided into five types
First: function problems (not sequences)
The second type: one to many
The third kind: more than one
The fourth type: more than the interval
The fifth type: many to many
The RNN not only handles the input of the sequence, but also the output of the sequence, which refers to the sequence of vectors.
RNN learned is the program (state machine), not the function of the typical application
Https://github.com/karpathy/neuraltalk2
One to many input is a picture, and output a sequence of text (input and output at least by one is a sequence)
http://vlg.cs.dartmouth.edu/c3d/
More than one: Enter a text to classify the text (the text is longer);
Detection of event in the video screen: (Find the video shot in a set of video screens)
More than the interval: language translation
Http://research.microsoft.com/apps/pubs/default.aspx?
id=264836
Many to many: the description of the video screen, automatically give the text of the commentary. Sequence Prediction
The input is a sequence, the output is also a sequence, and outputs is the next sequence. Used to make the build model. (Music generator)
F is often difficult to model, in order to simulate F, the G model is also dependent on previous inputs and previous states (introducing state variables)
How to explain the problem. Sequence Prediction Model
The RNN is not structured to describe the sample, and is fitted with the neuron's scale and multiplier. Benefit: End to end issues.
Left: The forward: x input, the operation (sigmoid) to get H (x), and then the operation to get the output, the previous state (H (T-1)) to participate in the operation.
The final output is based on the results of the new union, making a full connection to get Y;
It can also be represented as the right figure:
H0 and x0 can be arbitrarily defined, and the predicted value can be used as the next input. RNN Training
Add the loss values defined for each step. (Not weighted)
As you can see, the derivation of the formula in the red box in the image below
Using chain rules to form the expansion of the multiplication, the multiplication may have some problems, if w is less than 1, then the multiplication may be approaching 0, if W is very large, the result of the multiplication may be more and more large.
This is where the gradient disappears and explodes.
The reason for the disappearance and dispersion here is different from the previous one, where the disappearance and dispersion are caused by the expansion of the sequence rather than by the product of the network space. If the network is composed of multiple hidden layers, it will aggravate the disappearance and dispersion. BPTT algorithm: Solution
But the actual use of the very few, because the W of the connection is unavoidable.
An improved form of RNN with common use:
Set some thresholds via input input:
The most widely used, successful RNN
A block is made up of two outputs, adding a new variable C_cell, and the cell's state value has neurons that are long-term, some short-term. H is used immediately to the output, C has been passed in step. It could be a two-step previous value, or it could be a 100-step previous value, that is, the dimension in C is a long-ago state, or a very near state. Each a block is called a layer
Lstm:forget/input Unit
H (t-1) and X as two input "" means multiply and add. F control how much to forget,
Lstm:update cell
FT: Controls the percentage of retention states. Shows the use of long memory or short memory.
It: Ensure the CT can be adjusted. Updated by how many.
Lstm:output
HT updates based on OT
Overall process
The joined node blocks the path of derivation, causing some of the derivative paths to break. lstm Other variants
using Lstm
High complexity, difficult to train
Resources:
July algorithm: http://www.julyedu.com/
Photo from the course PPT