Rcnn,fast Rcnn,faster Rcnn's previous incarnation: (3) spp-net

Source: Internet
Author: User

spp-net is a paper published in the IEEE in 2015-"Spatial Pyramid Pooling in deep convolutionalNetworks for Visual Recognition ".

The core of the pooled space pyramid is:

1, because, CNN requires image fixed size, so to do crop and warp. is due to the weight training that will affect the FC layer.

When the network input is an arbitrary size of the picture, this time we can continue to convolution, pooling, until the last few layers of the network, that is, we are about to connect with the full connectivity layer when we need to use the (maximum) pool of space pyramid, So that the arbitrary size of the feature map can be converted to a fixed-size eigenvector.

2. in the original proposal, after the multi-layer convolution, the position is still relative to the original image (as shown), then the problem that needs to be solved now is how to map the original proposal to the feature map obtained after the convolution, Because we're going to do pyramid pooling for proposal after this.

Assuming (x ', Y ') represents the coordinate point on the feature map, and the coordinate point (x, Y) represents the point on the original input image, then there is a conversion relationship between them, which concerns the network structure: (x, y) = (s*x ', S*y ')

In turn, we want to solve (x ', y ') by using the coordinates of (y '), then the formula is as follows:

where S is the product of all strides in CNN, containing the stride of pooling and convolution.

------------------------------------------

Prior to this, all neural networks were required to enter fixed-size images, such as 224*224 (ImageNet), 32*32 (lennet), 96*96, and so on. In this way we want to detect various sizes of images, need to pass crop, or warp a series of operations, which to a certain extent caused the loss of picture information and deformation, limiting the recognition accuracy. And, from a physiological point of view, when the human eye sees a picture, the brain first thinks it's a whole, not crop and warp, so it's more likely that our brains will recognize these arbitrary shapes at a deeper level by collecting shallow information.

Why do I have to fix the size of my input picture?

Convolution layer parameters and input size independent, it is just a convolution core in the image to slide, no matter how much input image does not matter, just to different sizes of the picture convolution out different size of the feature map, but the full-join layer of the parameters and the input image size, because it to the input of all the pixels connected, You need to specify the number of input layer neurons and the number of neurons in the output layer, so the size of the input feature needs to be specified.
Therefore, fixed-length constraints are limited to the full-join layer. For example, explain:

As a fully connected layer, if the input X-dimension range, then the parameter w will certainly be different, so the full-join layer is to determine the input, the number of outputs.

How does spp-net adjust the network structure?

Spp-net after the last convolutional layer, access to the pyramid pooling layer, in this way, you can let the network input arbitrary images, but also generate a fixed-size output.

What is pyramid pooling?

Explain the example:

The black image represents the feature map after convolution, then we take different size blocks to extract the features, respectively, 4*4,2*2,1*1, the three net lattice on the following feature map, you can get 16+4+1=21 species of different blocks (Spatial bins), we from these 21 blocks, Each block extracts a feature that is exactly the 21-dimensional eigenvector we want to extract.

The process of pooling in a combination of different size lattices is the spatial pyramid Pooling (SPP). For example, in order to maximize the spatial pyramid pool, in fact, from the 21 image blocks, respectively, calculate the maximum value of each block, so as to get an output unit, and finally get a 21-dimensional feature output.

From the overall process, it is as follows:

256 represents the number of feature maps

The output vector size is mk,m= #bins, k= #filters as input to the full join layer.

For example, so the CONV5 calculated feature map is also arbitrary size, now after the SPP, it can become a fixed-size output, for example, can be output (16+4+1) *256 features.

What is the meaning of pyramid pooling?

In summary, when the network input is an arbitrary size of the picture, this time we can continue to convolution, pooling, until the last few layers of the network, that is, we will be connected with the full connection layer, we will use pyramid pooling, so that any size of the feature map can be converted to a fixed-size eigenvector, This is the meaning of the spatial pyramid pooling (the multi-scale feature extracts the fixed-size eigenvector).

Network Training phase:

This paper divides the training of the network into two kinds: one is single-size and the other is multi-size.

First explain the training process of single-size:

Theoretically speaking, Spp-net supports direct BP with multi-scale original image as input. In fact, Caffe and other implementations, in order to calculate the convenience, Gpu,cuda and so more suitable for fixed-size input, so the training when the input is fixed on the scale. Take the input of 224*224 as an example:

After conv5, the characteristic graph is: 13x13 (a*a)
Pyramid Layer Bins:n*n
Use the pooling layer as the sliding window pooling.
WINDOWS_SIZE=[A/N] Rounding up, stride_size=[a/n] downward rounding.

For example, the parameters given in the paper are as follows:

The formula for the pool 3*3:sizex=5 is: [13/3] rounding up =5, stride = 4 is calculated as: [13/3] rounding down.

If the input is changed to 180x180, this time conv5 out the reponse map for 10x10, similar method, can get new pooling parameters.

For multi-size training that is: training with two scales: 224*224 and 180*180

During training, 224x224 's pictures are obtained by crop, and 180x180 images are obtained by zooming in on the images of 224x224. After that, the iterative training, which uses 224 of the images to train an epoch, then 180 of the pictures to train an epoch, alternately.

At both scales, after the SSP, the feature dimensions of the output are (9+4+1) x256, the parameters are shared, and then the full connection layer is followed.

The benefit of such training, the paper says, is that it can converge faster.

Network test phase

Enter a picture of any size

Comparison of Spp-net and R-CNN

For R-CNN, the whole process is:

    1. First, through selective search, the image of the detection of the search for a ~2000 candidate window.
    2. Zoom the image of the 2k candidate window to 227*227, then enter CNN separately, each proposal extract a eigenvector, that is, use CNN to extract the eigenvector for each proposal.
    3. The corresponding eigenvector of each candidate window above is used to classify and identify with SVM algorithm.


It can be seen that the calculation of R-CNN is very large, because 2k candidate windows are to be input into the CNN, respectively, feature extraction.

And for Spp-net, the whole process is:

    1. First, 2000 candidate windows are searched for the detected images by selective search. This step is the same as the r-cnn.
    2. Feature extraction phase. This step is the biggest difference with R-CNN, this step of the specific operation is as follows: The entire picture to be detected, input into CNN, a feature extraction, get feature maps, and then in feature maps to find the various candidate box area, Then, the candidate frames are pooled with pyramid space, and the fixed-length eigenvector is extracted. The R-CNN input is each candidate box, and then enter the CNN, because spp-net only need to extract the entire image at once, the speed will be greatly improved.
    3. The last step is the same as R-CNN, SVM algorithm is used to classify feature vectors.
Mapping a Window to Feature Maps

We know that in the original proposal, after the multi-layer convolution, the position is still relative to the original image (as shown), then the question that needs to be solved now is how to map the original proposal on the feature map obtained after the convolution, Because we're going to do pyramid pooling for proposal after this.

For the mapping relationship, a formula is given in the paper:

Assuming (x ', Y ') represents the coordinate point on the feature map, and the coordinate point (x, Y) represents the point on the original input image, then there is a conversion relationship between them, which concerns the network structure: (x, y) = (s*x ', S*y ')

In turn, we want to solve (x ', y ') by using the coordinates of (y '), then the formula is as follows:

where S is the product of all strides in CNN, containing the stride of pooling and convolution.

For example, for a centralized network structure, S is calculated as follows:

In this paper, ZF-5 is used: s=2*2*2*2=16
Overfeat-5/7:s =2*3*2 =12

Detection Algorithm

For the detection algorithm, this is done in the paper: using the SS to generate ~2k candidate boxes, scaled Image min (w,h) =s after extracting features, each candidate box using a 4-layer spatial pyramid pooling features, the network uses the ZF-5 sppnet form. After that, the 12800d feature is input to the full-connected layer, and the input of the SVM is the output of the full connection layer.

This algorithm can be applied to multi-scale feature extraction: First, the picture is resize to five scales: 480,576,688,864,1200, plus 6 of its own. Then in the map
In the window to feature map step, select the ROI box scale to extract the corresponding ROI feature from feature maps at the scale closest to 224x224 in {6 scale}. Doing so can improve the accuracy of the system.

The contrast between spp-net and other networks is not explained much here.

The Complete spp-net

Finally, a picture is used to describe the spp-net completely.


Reference: http://blog.csdn.net/v1_vivian/article/details/73275259

Rcnn,fast rcnn,faster rcnn: (3) Spp-net

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.