STN_OCR recognition of spatial transform networks

Source: Internet
Author: User
Tags generator

From the paper spatial Transformer Networks

Insight:

The effect of STN is similar to that of traditional correction. For example, face recognition, you need to first detect the detection of the key point detection, and then use the key points to the alignment operation. But such a process requires extra processing. But with the STN, after the detection of the face, directly can do alignment operation. The key point is that the corrective process is capable of gradient conduction. Imagine, face detection, directly using ROI pooling to remove the face of the feature map, input STN can be corrected, output after the correction of the human face. The back can also contact convolution operation, directly can be classified, face recognition training. In theory, the whole process has gradient conduction, in theory can be detected + alignment + recognition using a network implementation. Of course, there may be various trick in the actual operation.

Basics of spatial Transformation:

2d affine transformation (affine):

Translation:


Rotating:

Scaling:

3D Perspective Transformation (projection):

Translation:

Rotating:

Scaling:

STN Network structure:

The STN network is composed of localisation network, Grid generator,sampler,3 parts.

Localisation Network:

The network is a simple regression network. A number of convolution operations are performed on the imported picture, and then the full connection is returned to the 6 angle values (assumed to be affine transformations), the 2*3 matrix.

Grid Generator:

The grid generator is responsible for the coordinates position in V, through the matrix operation, calculates in the target figure v each position corresponding to the original picture u coordinates the position. That is, T (G) is generated.

The grid sampling process here is a simple matrix operation for two-dimensional affine transformations (rotation, translation, scaling).

In the formula, s represents the coordinates of the original graph, and T represents the coordinates of the target graph. A 6-point value for localisation network network regression.

The whole grid generation process is, first of all, you need to imagine that the v-featuremap in the picture is all white or All black, and there is no pixel information. That is to say, V-featuremap does not exist, some just v-featuremap coordinate position information. Then the target Figure V-featuremap (0,0) (0,1) ... Coordinates of the position, with the 2*3 transformation matrix operation. The corresponding coordinate information in the original graph is generated, such as (5,0) (5,1). So that all the coordinates of the target graph are calculated so that each coordinate will produce a corresponding image of the coordinates, that is, T (G). The pixels in the original graph are then copied to the V-featuremap by u-featuremap pixels in T (G) and the original figure, thus generating the pixel of the target graph.

Sampler:

The sampler is sampled in the original picture u based on the coordinate information in T (G), and the pixels in U are copied to the target figure V.

Experimental results:

The authors experimented on 3 datasets in Mnist,street View house Numbers, cub-200-2011 birds DataSet.

Mnist Experiment:

R:rotation (rotation)

Rts:rotation, scale and translation (rotate, zoom, pan)

P:projective Transformation (projection)

E:elastic warping (elastic deformation)

From the drawing can be seen, FCN error rate is 13.2%, CNN error rate is 3.5%, compared with the ST-FCN error rate is 2.0%, st-cnn error rate is 1.7%. We can see that the effect of STN is still very obvious.

Street View House Numbers experiment:

You can see that st-cnn is less than the traditional CNN error rate, whether it's 64 pixels or 128 pixels.

cub-200-2011 Birds DataSet Experiment:



The right Figure red Box detects the head, the Green box detects the body.

This dataset is a data set that belongs to a fine-grained taxonomy. Many of the finer-grained articles will be tested on the dataset. From this experiment can be seen, STN can have a attention effect, can be trained to pay more attention to ROI region.

The results of the experiment were 0.8% elevated.

References:

Https://github.com/kevinzakka/spatial-transformer-network



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.