Deep Learning paper notes (IV.) The derivation and implementation of CNN convolution neural network
[Email protected]
Http://blog.csdn.net/zouxy09
I usually read some papers, but the old feeling after reading will slowly fade, a day to pick up when it seems to have not seen the same. So want to get used to some of the feeling useful papers in the knowledge points summarized, on the one hand in the process of finishing, their own understanding will be deeper, on the other hand also convenient for the future of their own investigation. Better can also put on the blog to communicate with you. Because the foundation is limited, so some understanding of the paper may not be correct, but also hope that everyone is not hesitate to communicate, thank you.
The papers in this paper come from:
Notes on convolutional neural Networks, Jake Bouvrie.
This is mainly a few notes on the derivation and implementation of CNN, and it is better to have some of the basics of CNN before you can read this note. Here is a list of information for reference:
[1] deep Learning (depth learning) study notes finishing series (vii)
[2] LeNet-5, convolutional neural networks
[3] convolutional neural networks
[4] Neural Network for recognition of handwritten Digits
[5] Deep learning: 38 (Stacked CNN Brief introduction)
[6] gradient-based Learning applied to document recognition.
[7] Imagenet classification with deep convolutional neural networks.
[8] "convolution feature extraction" and "pooling" in UFLDL.
In addition, here is a matlab deep learning toolbox, which contains the CNN code, in the next blog post, I will be detailed comment on this code. This note is very important to the understanding of this code.
Here are some of the things that you know about them:
Notes on convolutional Neural Networks
First, Introduction
This document discusses the derivation and implementation of CNNs. The CNN architecture has a lot more connections than weights, which in fact implies some form of regulation. This particular network assumes that we want to learn some filters in a data-driven way as a way to extract the features of the input.
In this paper, we first describe the classical BP algorithm for training all-connected networks, and then deduce the BP weights updating method of the convolution layer and sub-sampling layer of 2D CNN Network. In the derivation process, we emphasize the efficiency of the implementation, so we will give some MATLAB code. Finally, we turn to the discussion of how to automatically learn to combine feature maps from the previous layer, and, in particular, to learn sparse combinations of feature maps.
Second, all-connected inverse propagation algorithm
In a typical CNN, the starting layers are alternating between convolution and down-sampling, and then at the last layer (near the output layer), all connected one-dimensional networks. At this point we have transformed all two-dimensional 2D feature maps into the input of a fully connected one-dimensional network. Thus, when you are ready to input the final 2D feature maps into a 1D network, a convenient way is to connect all the output feature maps to a long input vector. Then we return to the discussion of the BP algorithm. (More detailed basis derivation can refer to the "Inverse conduction algorithm" in UFLDL).
2.1. Feedforward Pass forward Propagation
In the following deduction, we use the squared error cost function. We are talking about multiple types of problems, a total of C classes, and a total of n training samples.
This represents the nth dimension of the label corresponding to the Nth sample. Indicates the nth output of the network output corresponding to the Nth sample. For multi-Class problems, the output is generally organized as "one-of-c", that is, only the output node of the class that the input corresponds to is positive, the bits of other classes or the nodes are 0 or negative, depending on the activation function of your output layer. Sigmoid is 0,tanh is-1.
Because the error in all training sets is just the sum of the errors of each training sample, here we first consider the BP for a sample. For the error of the Nth sample, it is expressed as:
In the traditional fully-connected neural network, we need to calculate the partial derivative of the cost function e about each weight of the network according to the BP rule. We use L to represent the current layer, so the output of the current layer can be expressed as:
Output activation function f (.) There can be many kinds, usually the sigmoid function or the hyperbolic tangent function. The sigmoid output is compressed to [0, 1], so the final output average tends to be 0. So if we normalized our training data to 0 mean and variance of 1, we can increase the convergence in the process of gradient descent. The hyperbolic tangent function is also a good choice for normalized datasets.
2.2, backpropagation pass reverse transmission
The error of the back propagation can be seen as the sensitivity of the base of each neuron sensitivities (the sensitivity means how much our base b changes, how much the error changes, i.e. the rate of change in the base of the error, that is, the derivative), as defined below: (the second equals is obtained from the chain rule of the derivative)
Because of the ∂u/∂b=1, so ∂e/∂b=∂e/∂u=δ, that is to say bias sensitivity ∂e/∂b=δ and error E to a node all input u derivative ∂e/∂u is equal. This derivative is the reverse propagation of high-level error to the bottom genius. The reverse propagation is the following relationship: (the next expression is the sensitivity of the first layer, that is)
Formula (1)
Here's the "?" Represents the multiplication of each element. The sensitivity of the neurons in the output layer is not the same:
Finally, each neuron is updated with a delta (or δ) rule for weights. Specifically, for a given neuron, get its input and then scale with the delta (ie δ) of the neuron. The expression in the form of a vector is that for the first L layer, the derivative of the error for each weight (combined matrix) of the layer is the cross-multiplication of the input of the layer (equal to the output of the previous layer) and the sensitivity of the layer (in which the δ of each neuron of the layer is combined into a vector form). Then the resulting partial derivative multiplied by a negative learning rate is an update of the weights of that layer's neurons:
Formula (2)
The update expression for the bias base is similar. In fact, for each weight (W) IJ There is a specific learning rate of ηij.
Three, convolutional neural Networks convolutional Neural network
3.1, convolution Layers convolution layer
We now focus on the BP update of the convolutional layer in the network. In a convolution layer, the feature maps of the previous layer are convolution by a learning convolution kernel, and then an activation function is used to get the output feature map. Each output map may be a combined convolution of multiple input maps values:
Here MJ represents the selected set of input maps, so which maps do you choose to enter? Have a choice of one or three. But below we will discuss how to automatically select the feature maps that need to be combined. Each output map gives an additional bias B, but for a particular output map, the convolution cores for each input maps are not the same. That is, if the output feature map J and the output feature map K are summed from the convolution in the input map I, then the corresponding convolution cores are not the same.
3.1.1, Computing the gradients gradient calculation
We assume that each convolutional layer L will be followed by a lower sample layer l+1. For BP, according to the above we know that in order to obtain the weight value of each neuron of the layer L is updated, it is necessary to first seek the sensitivity δ of each nerve node of the layer L (i.e., the formula of weight update (2)). In order to obtain this sensitivity we need to sum (get δl+1) The sensitivity of the next layer of nodes (the nodes of the l+1 layer connected to the node of interest in the current layer L), and multiply the weights of those connections (the weights of the nodes of interest and the l+1 layer) W. Multiply the current layer l of the neuron node of the input U of the activation function F of the guide value (that is, the sensitivity of the inverse propagation formula (1) ΔL solution), so that the current layer L each nerve node corresponding to the sensitivity δl.
However, because of the presence of the next sample, a pixel (neuron node) of the sampling layer corresponds to a pixel (sample window size) corresponding to the output map of the convolution layer (the previous layer). Therefore, each node of a map in layer L is connected to only one node of the corresponding map in the l+1 layer.
In order to effectively calculate the sensitivity of the layer L, we need to sample the sensitivity map of the downsample layer at the bottom of the sample upsample, which corresponds to a sensitivity for each pixel in the feature map, so it also composes a map. This makes the sensitivity map size consistent with the convolution layer map size, and then multiplies the partial derivative of the map activation value of the layer L with the sensitivity map obtained from the above l+1 layer (i.e. the formula (1)).
The weights of the map at the bottom of the sample layer take a value of β, and it is a constant. So we just need to multiply the result of the previous step by one β to complete the calculation of the L-layer sensitivity δ.
We can repeat the same computational process for each feature map J in the convolution layer. It is clear, however, that a map matching the corresponding sub-sampling layer (reference formula (1)) is required:
Up (.) Represents an on-sample operation. If the sampling factor for the next sample is n, it simply copies each pixel horizontally and vertically n times. This will restore the original size. In fact, this function can be implemented with the Kronecker product:
Well, here, for a given map, we can calculate its sensitivity map. Then we can quickly calculate the gradient of the bias base by simply summing all the nodes in the sensitivity map of the layer L:
Formula (3)
Finally, the gradient of the weight of the convolution kernel can be calculated by the BP algorithm (equation (2)). In addition, many of the weights of the connections are shared, so for a given weight, we need to get a gradient for all connections to that weight (the value-sharing connection), and then sum the gradients, just as the above gradient calculations for bias base:
Here, the patch that is multiplied by the element at the time of the convolution, the value of the position of the output convolution map (U, v) is the result of the K_ij-by-element multiplication of patches in the position of the previous layer (U, v).
At first glance, it seems that we need to painstakingly remember the output map (and the corresponding sensitivity map) each pixel corresponds to which patch of the input map. But in fact, in MATLAB, you can do it with one code. For the formula above, you can use the convolution function of MATLAB to achieve:
We rotate the delta sensitivity map first so we can do cross-correlation calculations instead of convolution (in the mathematical definition of convolution, the feature matrix (convolution kernel) needs to be flipped (flipped) when it is passed to Conv2. That is, the rows and columns of the feature matrix are reversed. And then turn the output back, so that when we roll forward the convolution, the convolution core is the direction we want.
3.2, sub-sampling Layers sub-sampling layer
For the sub-sampling layer, there are n input maps, there are n output maps, but each output map is smaller.
Down (.) Represents a down-sampled function. A typical operation is to sum all pixels of a block of different nxn of the input image. This way the output image is reduced by N times in two dimensions. Each output map corresponds to one of its own multiplicative bias β and an additive bias B.
3.2.1, Computing the gradients gradient calculation
The hardest thing to do here is to calculate the sensitivity map. Once we get this, the only offset parameter we need to update is beta and B, which can be a breeze (formula (3)). If the next convolution layer is fully connected to the sub-sampling layer, then BP can be used to calculate the sensitivity maps of the sub-sampling layer.
We need to calculate the gradient of the convolution core, so we have to find out which patch in the map is the pixel that corresponds to the output map. Here, it is necessary to find the sensitivity map of the current layer which patch corresponds to the next layer of sensitivity map of a given pixel, so that the formula (1) such as Δ recursion, which is the sensitivity of the reverse propagation back. In addition, you need to multiply the weight of the connection between the input patch and the output pixel, which is actually the weight of the convolution kernel (rotated).
Before we do that, we need to spin the core so that the convolution function can perform cross-correlation calculations. In addition, we need to process the convolution boundary, but in MATLAB it is easier to deal with it. The full convolution in MATLAB will complement the missing input pixels by 0.
Here, we can calculate the gradient for B and beta. First, the calculation of Gaseki B is the same as the convolution layer above, and all the elements in the sensitivity map can be added together:
For multiplicative bias β, because it involves the calculation of the sampled map in the forward propagation process, it is best to keep these maps in the forward direction so that they do not have to be recalculated in the inverse calculation. We define:
Thus, the gradient of β can be calculated in the following way:
3.3. Learning combinations of Feature Maps learning feature map combination
Most of the time, it is often better to add multiple inputs to maps and then sum up the convolution values to get an output map. In some literatures, it is generally manual to choose which input maps to combine to get an output map. But here we are trying to get CNN to learn these combinations in the course of training, which is the best way for the web to learn to choose which input maps to calculate the output map. We use Αij to indicate the weight or contribution of the input map of the first J output map. Thus, the J output map can be expressed as:
Constraints need to be met:
These constraints on variable αij can be enhanced by representing the variable αij as a Softmax function of a group of unconstrained implicit weights CIJ. (because Softmax's dependent variable is an exponential function of an argument, their rate of change will vary).
Because for a fixed J, each group of weights CIJ is independent of the weights of the other groups, so for the sake of description, we remove subscript j, only consider a map update, the other map update is the same process, just map index j is different.
The derivative of the Softmax function is expressed as:
Δ here is the Kronecker delta. The derivative of the like for the error for the L-layer variable is:
Finally, the derivative of the weighted CI can be obtained by chaining rules:
3.3.1, enforcing Sparse combinations enhanced sparsity combination
To limit the like is sparse, which is to restrict an output map to only some but not all of the input maps. We add sparse constraints to the overall cost function? (α). For a single sample, the override cost function is:
Then we look for the contribution of this rule constraint term to the derivation of the weighted CI. A rule-of-law? (α) The derivation of like is:
Then, through the chain rule, the derivation of CI is:
So, the final gradient of the weighted CI is:
3.4. Making it Fast with MATLAB
CNN's training is mainly in the convolution layer and sub-sampling layer interaction, the main computational bottleneck is:
1) Forward propagation process: Next sampling maps for each convolutional layer;
2) Reverse propagation process: the sensitivity map of the upper sampling layer is sampled to match the size of the underlying convolutional layer output maps;
3) application and derivation of sigmoid.
For the first and second questions, we consider how to use MATLAB built-in image processing functions to achieve the operation of sampling and down-sampling. For the upper sampling, the Imresize function can be done, but it takes a lot of overhead. A more rapid version is the use of the Kronecker product function Kron. The effect of the above sampling can be achieved by using a whole matrix ones to kronecker the product with the matrix we need to sample. For the next sampling in the forward propagation process, Imresize does not provide the ability to calculate the pixels within the NXN block while shrinking the image, so it is not available. A good and fast method is to convolution the image with a single convolution kernel, and then simply sample the final convolution result with a standard indexing method. For example, if the next sampled field is 2x2, then we can convolution the image with a 2x2 element that is all 1 convolution cores. Then after the convolution of the image, we every 2 points, y=x (1:2:end,1:2:end), so that we can get twice times the next sampling, while performing the sum effect.
For the third question, actually some people think that matlab in the sigmoid function to the definition of inline is faster, in fact, MATLAB and C + + and other languages are not the same, MATLAB, the inline is more than the normal definition of the function of the time. So, we can use real code that computes the sigmoid function and its derivative directly in the code.
Deep Learning paper notes (d) the derivation and implementation of CNN convolutional Neural Network (EXT)