STN_OCR recognition of spatial transform networks

Last Update:2018-08-22 Source: Internet

Author: User

Tags generator

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From the paper spatial Transformer Networks

Insight:

The effect of STN is similar to that of traditional correction. For example, face recognition, you need to first detect the detection of the key point detection, and then use the key points to the alignment operation. But such a process requires extra processing. But with the STN, after the detection of the face, directly can do alignment operation. The key point is that the corrective process is capable of gradient conduction. Imagine, face detection, directly using ROI pooling to remove the face of the feature map, input STN can be corrected, output after the correction of the human face. The back can also contact convolution operation, directly can be classified, face recognition training. In theory, the whole process has gradient conduction, in theory can be detected + alignment + recognition using a network implementation. Of course, there may be various trick in the actual operation.

Basics of spatial Transformation:

2d affine transformation (affine):

Translation:

Rotating:

Scaling:

3D Perspective Transformation (projection):

Translation:

Rotating:

Scaling:

STN Network structure:

The STN network is composed of localisation network, Grid generator,sampler,3 parts.

Localisation Network:

The network is a simple regression network. A number of convolution operations are performed on the imported picture, and then the full connection is returned to the 6 angle values (assumed to be affine transformations), the 2*3 matrix.

Grid Generator:

The grid generator is responsible for the coordinates position in V, through the matrix operation, calculates in the target figure v each position corresponding to the original picture u coordinates the position. That is, T (G) is generated.

The grid sampling process here is a simple matrix operation for two-dimensional affine transformations (rotation, translation, scaling).

In the formula, s represents the coordinates of the original graph, and T represents the coordinates of the target graph. A 6-point value for localisation network network regression.

The whole grid generation process is, first of all, you need to imagine that the v-featuremap in the picture is all white or All black, and there is no pixel information. That is to say, V-featuremap does not exist, some just v-featuremap coordinate position information. Then the target Figure V-featuremap (0,0) (0,1) ... Coordinates of the position, with the 2*3 transformation matrix operation. The corresponding coordinate information in the original graph is generated, such as (5,0) (5,1). So that all the coordinates of the target graph are calculated so that each coordinate will produce a corresponding image of the coordinates, that is, T (G). The pixels in the original graph are then copied to the V-featuremap by u-featuremap pixels in T (G) and the original figure, thus generating the pixel of the target graph.

Sampler:

The sampler is sampled in the original picture u based on the coordinate information in T (G), and the pixels in U are copied to the target figure V.

Experimental results:

The authors experimented on 3 datasets in Mnist,street View house Numbers, cub-200-2011 birds DataSet.

Mnist Experiment:

R:rotation (rotation)

Rts:rotation, scale and translation (rotate, zoom, pan)

P:projective Transformation (projection)

E:elastic warping (elastic deformation)

From the drawing can be seen, FCN error rate is 13.2%, CNN error rate is 3.5%, compared with the ST-FCN error rate is 2.0%, st-cnn error rate is 1.7%. We can see that the effect of STN is still very obvious.

Street View House Numbers experiment:

You can see that st-cnn is less than the traditional CNN error rate, whether it's 64 pixels or 128 pixels.

cub-200-2011 Birds DataSet Experiment:

The right Figure red Box detects the head, the Green box detects the body.

This dataset is a data set that belongs to a fine-grained taxonomy. Many of the finer-grained articles will be tested on the dataset. From this experiment can be seen, STN can have a attention effect, can be trained to pay more attention to ROI region.

The results of the experiment were 0.8% elevated.

References:

Https://github.com/kevinzakka/spatial-transformer-network

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More