derivation of the fully connected layer
Each node of the fully connected layer is connected to all nodes in the upper layer, which is used to synthesize the features extracted from the front. Because of its fully connected properties, the parameters of the general full connection layer are the most. forward calculation of fully connected layer
The 2 most densely connected areas in the diagram below are the full connection layer, which clearly shows that the parameters of the fully connected layer are indeed many. In the forward computing process, which is a linear weighted summation process, each output of the fully connected layer can be viewed as each node of the previous layer multiplied by a weight coefficient w, and finally a bias value B is obtained. The first fully-connected layer in the figure below, the input has 50*4*4 neuron node, output has 500 nodes, then the total need 50*4*4*500=400000 weight parameter W and 500 bias parameter B.
The following is a simple network with a specific introduction to the derivation process
Among them, X1, x2, x3 for the full connection layer of input, A1, A2, A3 for output, according to my front in Note 1 deduction, there are
Can be written in the following matrix form:
reverse propagation of fully connected layers
Take our first fully connected layer as an example, the layer has 50*4*4=800 input nodes and 500 output nodes.
Because we need to update W and B and pass the gradient forward, we need to calculate the following three partial derivatives.
1. The derivation of the previous layer's output (i.e. the current layer input) if we know the gradient of the layer, we can obtain the partial derivative of loss to X by the chain rule.
First, we need to obtain the partial derivative of the input x J of the output of the layer.
Then the partial derivative of loss to X is obtained by chain rule:
The result of the derivative above also confirms my front sentence: In the reverse propagation process, if the X layer of a node through the weight of the x+1 layer of the B-node contribution, then in the reverse propagation process, the gradient through the weight of the W from the B node to propagate back to a node.
If one of our training 16 pictures, that is, batch_size=16, then we can convert the calculation into the following matrix form.
2, to the weight coefficient w derivative
The formula for our forward calculation is shown below,
From the diagram, so:.
When Batch_size=16, it is written in matrix form:
3, to the bias coefficient b derivative from the above derivation formula,
That is, the partial derivative of the loss to the offset coefficient equals the partial derivative of the upper layer output.
When batch_size=16, add a different batch corresponding to the same B-derivative, and write a matrix that is multiplied by a total of 1 matrices:
------------------------------------------------------------------------------------------------------------ -----------------------
Next, we will talk about the significance of the whole connection layer.
The connection layer is the convolution operation of the volume kernel size, and the result of the convolution is a node, which corresponds to a point of the whole connecting layer. Assuming the output of the last convolution layer is 7x7x512, the fully connected layer connecting the convolution layer is 1x1x4096. The connection layer is the convolution operation of the volume kernel size, and the result of the convolution is a node, which corresponds to a point of the whole connecting layer. If the full connection layer is converted to a convolution layer:
1. A total of 4096 groups of filters
2. Each set of filters contains 512 convolution cores
3. The size of each convolution nucleus is 7x7
4. The output is 1x1x4096
------------------------------------------
If a 1x1x4096 full connection layer is attached to the rear. The corresponding parameters for the converted convolution layer are:
1. A total of 4096 groups of filters
2. Each set of filters contains 4,096 convolution cores
3. The size of each convolution nucleus is 1x1
4. Output is 1x1x4096
The equivalent is to combine the characteristics of the 4,096 classification scores of the calculation, the highest score is the right category to be zoned.
The disadvantage of the whole connection layer is that it destroys the spatial structure of the image,
So people began to use the convolution layer to "replace" the full connection layer,
The 1x1 is usually used as a convolution kernel, which does not include a fully connected CNN as a full convolution neural network (FCN),
FCN was originally used for image segmentation tasks,
Then began to be applied to various problems in the field of computer vision,
In fact, the CNN used to generate the candidate window in faster r-cnn is a FCN.
The characteristic of FCN is that both the input and output are two-dimensional images, and the input and output have corresponding spatial structure,
In this case, we can consider the output of FCN as a heat map, using heat to indicate the location of the target to be detected and the area covered.
Show high heat in the area of the target,
And the lower heat is shown in the background area,
This can also be seen as a classification of each pixel on the image,
Whether this point is on the target to be tested.