Roughly looked at this paper, very novel.
My point of view: in traditional convolutional neural netwoks, we usually depend in extracting features. This paper combines hand-crafted and feature extraction to handle invariance of various inputs.
Spatial transformer can include: panning, rotating, scaling, and so on. In mathematics, the name is: Affine transformasion, in addition: projective transformation, piece-wise affine, thin plate spline and so on.
Implementation method:
Examples of two dimensions in the text:
For two-dimensional image recognition, what if the picture is rotated? What about being panned? (The usual practice is that regardless of the input, I will be input to CNN, let it learn features), now we can manually correct it, the method is through the affine transformation, the transformation needs parameters, how to determine the parameters can only rely on people ... And this paper practice for the input of the image let auxiliary learn this parameter, and then affine transformation (this is determined in advance), and then the original operation. Another, this transformer can be applied to inputs, also can be used in feature maps.
Specific:
The first step: use a localization network to learn the parameters behind the transformer required; This network can be fully-connected or convolutional neural network s, both networks end up with a regression layer to output parameters;
Note: At first, I think the training of this parameter is also supervise, that is, the training time given to each input parameters, but I think wrong, the training process is not known parameters, but like other network parameters through back-propagetion training. In this case, I think that the initialization of this parameter is absolutely different, at first should be set to the corresponding parameters of the idnetity transform;
Step two: With the first step of the transformer parameter, this step is to find out the mapping of this transformer. The method in this paper is to find out: the sampling grid (a grid of the original pixel points) and the regular grid (a grid consisting of the transformed pixels).
Common practice: the mapping of the position from the input point to the output point is calculated. The approach of this article is to learn the mapping of the location from the output point to the input point. (usually: the mapping matrix of input and output is inverse of each other)
The benefit of this is that we can guarantee that the value of each pixel in the output is from the input pixel or the interpolation of the input pixel points. In turn, it's not working. Therefore, the mapping from the regular grid to the sampling grid should be asked, which is very important and easy to be wrong;
The parameters of the mapping matrix from regular grid to sampling grid are the parameters of the first step;
The third step: it is not possible to always map to the input pixels, so it is necessary to interpolate, the text of differentiable image Samling introduced a simple interpolation method (double linear interpolation, in the case of the value of the pixel, will use the relevant formula), It is proved that the process can be derivative, meaning that it is possible to train through back-propagation algorithm;
The whole process:
Use a graph in the text to indicate:
The picture is very simple;
I think the core:
This article introduces hand-crafted's transformer (e.g., affine transformer). However, the relevant parameters are obtained through the network training;
(This does not count): In addition, to find the discontinuous function derivative, need to use for sub-gradient (sub-gradient, I need to understand)
Unclear place:
Through the network training, learning the meaning of the relevant parameters of Tansformer, is it the spatial transformer that we want to get?? The experimental results in this paper show that it is. As shown (the second figure):
That is, the parameter is how to learn, there is no reference to the corresponding parameter of the label AH. Purely through back-propagation algorithm. This is a bit of a surprise to me!!!!!!!!
Citations: Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks[c]//advances in Neural information processing Syst Ems. 2015:2017-2025.
Spatial transformer Networks This essay