Https://zhuanlan.zhihu.com/p/24720659?utm_source=tuicool&utm_medium=referral
Author: Yjango
Link: https://zhuanlan.zhihu.com/p/24720659
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.
Everyone seems to be called recurrent neural networks is a circular neural network.
I was a Chaviki encyclopedia, so I've been calling it a recursive network. Recurrent neural Network (RNN) is a general term of two kinds of artificial neural networks. One is the temporal recurrent neural network (recurrent neural network) and the other is the structure recurrent neural network (recursive neural network).
The recursive networks I mentioned below all refer to recurrent neural Networks.
The discussion of recurrent neural network is divided into three parts: description of the difference and pros and cons of recursive network and Feedforward network
Realization: Gradient Vanishing and gradient explosion problems, and lstm and GRU for solving problems
Code: Using TensorFlow to actually demonstrate the training and use of a task
This is the first part:
Gitbook Initial Source Address: Recurrent neural network--Introduction
There are a lot of dynamic diagrams, please click to watch, can not see the words suggested to go above the Gitbook address reading recurrent neural network--Introduction of timing prediction
Code Illustration 3 has shown how to use the Feedforward Neural Network (feedforward) to predict time series signals.
First, the use of Feedforward neural network to do time series signal prediction What is the problem.
Dependency restriction: The Feedforward network uses Windows to process vectors at different times and connect them to a larger vector. Use it to predict what is happening. As shown in the following illustration:
But the dependencies it can take into account are limited by how many vectors (window size) are connected together. The dependencies that can be considered are always fixed-length.
Network specification: For better predictions, you need to make the network more dependent.
For example, if you only give "country ()", let the player guess the words in parentheses, there are many possibilities to think about. But if you give "China ()", the scope of the possibility is reduced. When it comes to "I am China", the likelihood of guessing will increase further.
So the natural way is to enlarge the number of vectors, but it will also make the dimensions of the input vector and the weight matrix of the first layer of the neural network increase rapidly. If the dimension of each input vector in code illustration 3 is 39, and 41 frames are processed, the dimension becomes 1599, and the weight matrix of the first layer of the neural network becomes 1599 by N (n is the number of nodes in the next layer). Many of these weights are redundant, but they have to be kept in every single prediction.
Number of samples required for training: the first two points may be harmless, but the key to the recursive neural network (recurrent) defeating Feedforward Neural networks in time series prediction is the amount of data needed to train the network. The difference between the network
Almost all neural networks can be seen as a specially formulated feedforward neural network, where the role of "special formulation" is to reduce the search space for mapping functions, but also because of the narrowing of the search space, so that the network can use a relatively small amount of data to learn better law.
Example: Solving a two-time equation. We need two sets of pairs to solve the equation. But if we know that it is, then we just need a couple to determine the relationship. The variants of neural networks such as recurrent neural networks and convolution neural networks have similar effects.
What is the difference between the recurrent neural network and the Feedforward neural network?
It should be noted that the recursive network is not only a structure, here is a most common and effective recursive network structure. Mathematical perspective
First, let's look at the spatial transformation from the input layer to the hidden layer, and the difference is to take the time dimension together.
Note: The circle here no longer represents the node, but the state, which is the input vector and the vector that is entered after the hidden layer. Feedforward Network: Window size 3 frame of the Feedforward network after Windows processing
Dynamic diagram: The left side is the time dimension before the expansion of the right is expanded (the unit time the actual work of only the gray part.) )。 The characteristics of feedforward networks make predictions at different times completely independent. We can only take care of the correlation by window treatment.
Mathematical equation: concat represents a vector that is connected to a larger dimension.
Learning parameters: Need to learn from a large number of data and.
It takes a lot of data to learn the relationships (39*3) of all dimensions (39 dimensions) at all times (3). Recursive network: No longer has the concept of window size, but time step
Dynamic diagram: The left is the time dimension before the expansion of the loop mode of expression, where the black box indicates time delay. When the right is expanded, you can see that the current moment does not depend solely on the input of the current moment, but also on the previous time.
Mathematical equation:. The same is also determined by the changed information,
But here's another message: The information is derived from the hidden state of the last moment after a different transformation.
Note: The shape is the behavior of Dim_input, listed as Dim_hidden_state, but a row is a phalanx of dim_hidden_state.
Learning parameters: The Feedforward network requires 3 moments to help with learning, while a recursive network can help you learn 3 times and 3 times. In other words: the weight matrices at all times are shared. This is the most prominent advantage of the recursive network relative to the Feedforward network.
Recurrent neural network is a variant of neural networks with shared characteristics in time structure.
Time structure sharing is the core of the recursive network. Physical perspective
The sharing feature brings many benefits to the network, but it also creates another problem:
Third, why can share.
In the physical perspective, the 1th point that Yjango wants to show you is why we can use this network that shares different time weights to predict the timing.
The image below can help you intuitively feel how many sequential signals in your daily life are generated.
Example 1: Trajectory generation, such as the Earth's trajectory has two clues to decide, one is the Earth rotation, the other is the Earth around the sun's revolution. The following figure is the sun and other planets. Rotation equals, and Revolution equals. Both determine the actual trajectory.
Example 2: The same universal ruler
Example 3: When playing the music, the instrument turns the force into the corresponding vibration to produce the sound, and the whole performance has a melody throughout the whole song. The physical characteristics of musical instruments are equivalent to the same instrument at all times the physical characteristics are shared at all times. There is also a hidden theme of the benchmark, melody information and music information together to determine the actual sound of the next moment.
The trajectories and music that are produced in the above examples are the observations we can observe, and we often use these observation as a basis for making decisions.
The following example may make it easier to understand the impact of shared features on the volume of data.
Example: Pinch ceramics: different angles equal to different moments
If using Feedforward Network: The network training process is equivalent to not a turntable, but the bare hands of each angle into the desired shape. Not only the heavy workload, the effect is also difficult to guarantee.
If a recursive network: The network training process is equivalent to the rotating turntable, with a gesture to fabricate all angles. Reduced workload, the effect can be guaranteed. Recursive network features a variable length of time: As long as you know the hidden state of the last moment and the input of the current time, you can calculate the current moment of the hidden state. And because the calculations are shared with each other at any time. A recursive network can handle any length of time series.
Consider time dependence: if the current moment is the 5th frame of the timing signal, then the calculation of the current hidden state will require the current input and the 4th frame of the hidden state, and the calculation needs, so continue to reverse to the initial moment. means that a conventional recursive network has dependencies on all the states of the past.
Note: When the calculated value is not specifically specified, the initial hidden state is fully 0,
An expression becomes a feedforward neural network:
Future information Dependency: Feedforward networks introduce future-time vectors to limit the current content judgment, but conventional recursive networks only have dependencies on all past states. So one extension of a recursive network is a bidirectional (bidirectional) recursive network: Two recursive layers in different directions.
Diagram: The forward (forward) recursive layer begins at the beginning and the reverse (backward) recursive layer begins at the last moment.
Mathematical equation: Forward recursive layer: reverse recursive layer:
Bidirectional recursive layer:
Note: There is also a connection to the treatment, i.e., but the role of the reverse recursive layer is to introduce future information to the current prediction of the additional restrictions. Not enough information dimensions. At least in all my experiments, the addition (sum) approach is often better than the next.
Note: There are also some people will be forward recursive layer and reverse recursive layer of weight and sharing, and sharing. I haven't done any experiments compared. But intuition and sharing may be slightly elevated in some tasks. I'm afraid that sharing is not going to work (to fit the task).
Note: The hidden state is usually not the end result of the network, and it will typically be followed by another project to the output state. One of the most basic recursive networks does not appear as a feedforward neural network, from the input layer directly to the output layer, but at least one hidden layer.
Note: Two-way recursive layer can provide a better predictive effect, but not real-time prediction, because the calculation of reverse recursion needs to start from the last moment, the network has to wait for the full sequence to be produced before it can start to predict. In the real-time recognition of the required online speech recognition, its application is limited.
Recursive network output: The emergence of a recursive network is actually an extension of the Feedforward network in the Time dimension.
Diagram: A regular network can associate input and output with vectors (no time dimensions). The introduction of the recursive layer extends it to the sequence matching. This creates a series of associated ways to the right of one to one. More special is the last many to many, which occurs when the input-output sequence length is indeterminate, its real two recursive network stitching is used, the public point in the purple hidden state. Many to one: commonly used in affective analysis, a sentence is associated with an emotional vector. Many to many: the first many to many is frequently used in the DNN-HMM speech recognition framework many to many (variable length): The second many to many is commonly used in machine translation of two different languages.
Recursive network data: The recursive network makes the training data different from the Feedforward network due to the introduction of time step.
Feedforward networks: Inputs and outputs: matrices
Input matrix shape: (n_samples, Dim_input)
Output matrix shape: (n_samples, Dim_output)
Note: When you really test/train, the input and output of the network is just a vector. Add N_samples This dimension is to be able to achieve a training of multiple samples, to find the average gradient to update the weight, this is called Mini-batch gradient descent.
If N_samples equals 1, then this update is called stochastic gradient descent (SGD).
Recursive network: Input and output: the dimension is at least 3 of the tensor, if the picture and other information, the tensor dimension will continue to increase.
Input tensor shape: (time_steps, N_samples, Dim_input)
Output tensor shape: (time_steps, N_samples, Dim_output)
Note: The same is retained Mini-batch gradient descent training method, but the difference is that more time step of this dimension.
The nature of the input at any moment of the recurrent is a single vector, but it is simply a vector that enters the network at different times. You might prefer to understand a string of vectors a sequence of vectors, or a matrix. Network treatment
Treat all networks with the concept of layers. Recurrent neural Network is a neural network with a recursive layer, the key of which is the existence of recursive layer in the network.
The role of each layer is to transform data from one space to another. Can be considered as a feature crawl or as a classifier. There is no obvious boundary between them and they contain each other. The key is how the user understands.
The advantage of understanding the network in terms of layer concept is that future neural networks will not use only one approach. Often it is feedforward, recursive, convolution mixed use. There is no longer a recursive neural network to name the structure.
Example: In the following diagram, two feedforward hidden layers are added before and after the bidirectional recursive layer. Can also accumulate more two-way recursive layer, people will be added in front of the "deep" two words, improve the force lattice.
Note: The layer is not a circle in the picture, but a connection. The circle represents the state before and after each layer. Recursive network problems
Conventional recursive networks should theoretically be able to take into account all the past moments of dependence, but the reality is not as people think. The reason is the gradient disappearance (vanishinggradient) and the gradient explosion (exploding gradient) problem. The next section describes the long short Term Memory (lstm) and gated recurrent unit (GRU): A special implementation algorithm for a recursive network.