Depth learning target detection (object detection) series (ii) spp-net

Last Update:2018-07-24 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Depth learning target detection (object detection) series (i) r-cnn
Depth learning target detection (object detection) series (ii) spp-net
Depth learning target detection (object detection) series (iii) Fast R-CNN
Depth learning target detection (object detection) series (iv) faster R-CNN
Depth learning target detection (object detection) series (v) R-FCN spp-net Introduction

In the previous R-CNN article, the R-CNN algorithm was introduced in detail, and also explained the r-cnn fatal defect, the ultra long training Time (84h) and the test time (47s), the main reason for this problem is the repeatability convolution computation, in r-cnn, The image entered into the CNN network is the area extracted by the SS algorithm, each of which is to be detected 1000-2000 regions, which means that the convolution calculation is repeated 1000-2000 times, but because the SS algorithm extracts the region itself there are many overlap, So this iterative calculation is very unnecessary.
Then can only through the convolution calculation of the entire image to complete the feature extraction work. This is the main contribution of Spp-net, but also the unified goal of many network structures after R-CNN- how to share convolution computation .

The main improvements in Spp-net are the following two:
1. Shared convolution calculation
2. Space Pyramid Pool

It is also composed of these parts in the spp-net:
SS algorithm
CNN Network
SVM classifier
Bounding box

The area of the SS algorithm is also generated on the original image, but is extracted on the Conv5, of course, due to the size of the change in the CONV5 layer to be extracted on the scale of the transformation, which is its r-cnn the biggest difference, but also spp-net can significantly shorten the length of the reason. Because it took full advantage of convolution calculations, that is, each picture only convolution once, but this improvement has brought a new problem, because the SS algorithm generated by the recommended frame scale is inconsistent, so the characteristics of the cov5 extracted on the scale is inconsistent, so there is no way to do full size convolution (alexnet).
So Spp-net needs an algorithm that can produce a uniform output of inconsistent input, which is the SPP, the space pyramid pool, which replaces the pooling layer in r-cnn, and in addition, it is the same as R-CNN. how to share convolution calculations

In the above diagram, the difference between R-CNN and Spp-net is explained, the input of r-cnn convolution neural network is the recommended region of SS generation (after the dimension normalization), and the input of the convolution neural network in spp-net is the whole picture, after the convolution feature is extracted, Make a recommendation area on the CONV5. Here is a problem is a picture after convolution, the size of the image will change, then on the original artwork generated in the SS area, there is no way to directly buckle on the CONV5 layer, so you need to do a coordinate transformation to adapt to the CONV5 layer of the wide-high dimensions. Coordinate Transformation

The width and height of the feature in CNN is changed because of the step selection, when the step selection is 2 o'clock, the width and height of the image will be half the size of the original, so for a point in the suggested area (x,y), the position on the corresponding CONV5 layer (x ', y ') should satisfy the following relationship:
(x,y) = (s*x ', S*y ')
where S is the product of the step lengths of all layers.
And due to the padding problem in the convolution process, the feature on the CONV5 will be closer to the center of the image, which I personally think is why the point in the upper left corner should be pixel plus 1, and the point in the lower right corner should be reduced by 1 pixels:
Upper left: X ' = (x/s) +1
Bottom right: X ' = (x/s)-1 space Pyramid pool

After the change in the coordinates, the generated on the original image of the proposed box can be mapped on the CONV5, but this will have a new problem, the extracted features due to the inconsistent size, can not be sent to the full connection layer, the solution mentioned above-SPP:

The above diagram explains the principle of the SPP, so for any size input, the SPP can divide the input characteristics into 16, 4, and 1, and do max pooling on each (Bin), while the thickness of the feature remains unchanged, and the features are then threaded as input to the full join layer, as shown in the above image , assuming that the characteristic thickness is 256, then the feature length (one dimensional feature) of the SPP is (16+4+1) *256, so the dimension is unified. spp-net Training and testing

the training process of spp-net:
First, we get the alexnet model of imagenet training, compute the CONV5 layer characteristics by alexnet, and extract the corresponding SPP from CONV5 according to the regional suggestion of SS generation. Use the extracted features to finetune the full join layer (alexnet as a classification model).
After the alexnet is trained, the SVM classifier is trained with the feature of FC7 layer, and the bounding box is trained with the SPP feature (this is the same as R-CNN).
the spp-net test process:
Firstly, the CONV5 and FC7 layer features of the whole picture are extracted with the trained alexnet network in a picture, and the 1000-2000 regional suggestions are generated by SS algorithm in the picture, and the feature of the SPP is extracted on Conv5 after the coordinate transformation of the regional suggestion box. FC7 layer features are fed into SVM to do category predictions, SPP features are fed into bounding box to make bounding box corrections. performance evaluation of Spp-net

The above diagram illustrates the performance comparison between the lower Spp-net and the R-CNN, where the training time spp-net takes 25 hours, while the R-CNN takes 84 hours The test time of the single picture spp-net only need 2.3s, and r-nn need 47s, this is the speed of the shared convolution calculation, and also the most important contribution of spp-net; the last metric, spp-net map, is lower than r-cnn, because the spp-net structure cannot fin Tune the convolution layer. the problem of spp-net

Finally, through the performance evaluation above, we can see that the spp-net has a significant increase in speed, and the idea of shared convolution calculation is used in subsequent fast r-cnn and faster r-cnn, but it can be seen from the spp-net training process that It is not possible to finetune the convolution layer, this problem is solved by using the multi-task loss function and ROI pooling in fast rcnn.
The Spp-net training process remains a multi-stage exercise, as is the case with R-CNN, and for improvement.
Because it is multi-stage training, a large number of features need to be stored in the process.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More