Spatial Transformer Networks (Space Transformation Neural Network)

Source: Internet
Author: User
Tags cos

Reference: Spatial Transformer Networks [Google.deepmind]
Reference:[theano source, based on lasagne] chatter: Big data is not as small as data

This is a very new paper (2015.6), three Cambridge PhD researcher from DeepMind, a Google-based new AI company.

They built a new local network layer, called the spatial transform layer, as its name, which can transform the input image into arbitrary space, for the characteristics of CNN.

In my paper [application and improvement of deep neural network in facial emotion Analysis System] , an interesting point is put forward:

Big data is inferior to small data, if big data cannot be effectively exploited by models.

This phenomenon is more common, such as the ML of a classic problem: the data is unbalanced, so that the model will be large-class data over-fitting, ignoring the small class of data.

In addition, the evolving culture vs local minima: cultural, evolutionary, and local minimum values are mentioned in the Curriculum learning perspective:

It is much more effective to divide big data into hard-to-be-easy sections and to learn in batches than all the direct plugs.

At present, our hottest model is still konjac Konjac, and the data is very ingenious, beyond imagination.

Thus, it is unrealistic to think that it is impossible to distribution the data by model, and the invention and improvement of the model structure is still the first priority.

Rather than just like Professor Li Feifei Sword Walk Pifo, with big data such as imagenet to advance the deep learning process.

The important meaning of spatial transformation

In my paper [application and improvement of deep neural networks in facial affective analysis] , three powerful reasons for CNN are analyzed:

[ Local],[Translational invariance],[reduced invariance], but also the missing [rotational invariance] to do the corresponding experiment .

The essence of these invariance is the classical means of image processing,[cropping], [panning],[zooming],[rotation].

These methods belong to a family: the spatial transformation, but also obeys the same method: the coordinate matrix affine transformation.

So, is there a way for neural networks to implement these transformations with a uniform structure and self-adapting? DeepMind is implemented in a simple way.

Image processing techniques: affine matrix, inverse coordinate mapping, bilinear interpolation 1.1 affine transformation matrix

Implement [crop],[pan],[zoom],[rotate], only one $[2,3]$ transformation matrix is required:

$\BEGIN{BMATRIX}\THETA_{11} & \theta_{12} & \theta_{13} \ \theta_{21}& \theta_{22} & \theta_{23}\end{ bmatrix}$


For panning operations, the coordinate affine matrix is:

$\begin{bmatrix}1 & 0 & \theta_{13} \ 0& 1 & \theta_{23}\end{bmatrix}\begin{bmatrix}x\\ y\\1\end{ bmatrix}=\begin{bmatrix}x+\theta_{13}\\


For scaling operations, the coordinate affine matrix is:

$\BEGIN{BMATRIX}\THETA_{11} & 0 & 0 \ 0& \theta_{22} & 0\end{bmatrix}\begin{bmatrix}x\\ y\\1\end{ bmatrix}=\begin{bmatrix}\theta_{11}x\\


For rotation operations, set the rotation of $\alpha$ clockwise around the origin, the coordinate affine matrix is:

$\begin{bmatrix}cos (\alpha) & Sin (\alpha) & 0 \-sin (\alpha) & cos (\alpha) & 0\end{bmatrix}\begin{ Bmatrix}x\\ Y\\1\end{bmatrix}=\begin{bmatrix}
cos (\alpha) x+sin (\alpha) y\\-sin (\alpha) X+cos (\alpha) y\end{bmatrix}$

Here is a trick, because the coordinates of the image are not the central coordinate system, so just do the next normalization, adjust the coordinates to [ -1,1].

This rotates around the center of the image, which is used later in this trick.


As for the cropping operation, there is no interpretation of paper's determinant values for the left 2x2 Sub-matrix, but can be interpreted from the coordinate range:

As long as the $x^{'}$, $y ^{'}$ range than $x$, $y $ Small, then it can be considered as the target map to locate the source map of the Local.

This affine transformation has no specific mathematical form, but it must be used in the neural network search process.

1.2 Reverse coordinate mapping

In the calculation of linear algebra, a classical solution idea is:

$\BEGIN{BMATRIX}\THETA_{11} & \theta_{12} & \theta_{13} \ \theta_{21}& \theta_{22} & \theta_{23}\end{ bmatrix}\begin{bmatrix}x^{source}\\ y^{source}\\ 1\end{bmatrix}=\begin{bmatrix}x^{target}\\ y^{Target}\end{ bmatrix}$

This approach is embarrassing for parallel matrix programming when it comes to image processing--the need to sacrifice additional spatial storage mapping sources:

Since $ (x^{target},y^{target}) $ is bound to be discrete when we need to get the value of $Pixel (X^{target},y^{target}) $ ,

If you do not save $ (X^{source},y^{source}) $in a timely manner, you must have instant single point copy $Pixel (X^{source},y^{source})->pixel (x^{target},y^ {Target}) $

Obviously, the implementation of this method relies on the $for$ loop:

$For (0....i .... Height) \ \ \quad for (0....J .... Width) \ \ \quad \quad calculate\&copy$

In order to make the Matrix parallel computing possible, we need to reverse the idea:

$\BEGIN{BMATRIX}\THETA_{11} & \theta_{12} & \theta_{13} \ \theta_{21}& \theta_{22} & \theta_{23}\end{ bmatrix}^{'}\begin{bmatrix}x^{target}\\ y^{target}\\ 1\end{bmatrix}=\begin{bmatrix}x^{source}\\ y^{Source}\end{ bmatrix}$

After that, the transformation target graph is transformed into an array of subscript element problems:

$PixelMatrix ^{target}=pixelmatrix^{source}[x^{source},y^{source}]$

This relies on a property of the affine matrix:

$\BEGIN{BMATRIX}\THETA_{11} & \theta_{12} & \theta_{13} \ \theta_{21}& \theta_{22} & \theta_{23}\end{ bmatrix}^{'}=\begin{bmatrix}\theta_{11} & \theta_{12} & \theta_{13} \ \ \theta_{21}& \theta_{22} & \ theta_{23}\end{bmatrix}^{-1}$

That is, when the target transformation is source, the new affine matrix is the inverse matrix of the source affine matrix.

1.3 bilinear interpolation

Considering a $[1,10]$ image amplification 10 times times the problem, we need to expand 10 pixels to the 100 axis, the entire image should have 100 pixels.

But 90 of the pixel coordinates corresponding to the source graph are non-integers, and if we use Black ($RGB (0,0,0) $), then the image is miserable.

So it is necessary to interpolate the missing pixels, and make use of the local approximation principle of the image data, and take the neighboring pixels to do the average generation.

Bilinear interpolation is a method of both mass and velocity (usually in some video games: linear interpolation, bilinear interpolation ...):

If $ (X^{source},y^{source}) $ is a real coordinate, then take the whole (truncated) and then expand the $d$ coordinate units along the axis to get $P _{21}$,$P _{12}$,$ p_{22}$

In general (source code), take $d =1$, the denominator in the formula is eliminated, and then use the bilinear interpolation in the graph interpolation, get the approximate value of $Pixel(X^{source},y^{source}) $ .

Spatial Transformer Networks (Space Transformation Neural Network)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.