Reading Paper Series: Object Detection spp-net

Last Update:2018-02-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is for you to interpret spp-net:

Spatial Pyramid Pooling in deep convolutional Networks for Visual recognition

Motivation

The success of neural network in computer vision is due to convolutional neural networks, however, many of the existing successful neural network structures require input into a fixed size (such as 224x224,299x299), the introduction of an image, it needs to be stretched or clipped, and then input into the network for operation.

However, cropping can lose information, and stretching can distort the image, which improves the threshold of the visual task, so if a model can receive input of various scales, it should be able to make the visual task easier to accomplish.

What limits the size of the input

The core components in the deep convolutional neural network are two, one is CNN, the other is the fully connected layer, and the convolution is using filter to translate the image with the local bit multiplication, multiple filter produces multiple feature map (feature/feature), You can then use the pooling operation for further sampling to get a smaller feature map; in fact, we don't care how large feature map is, the feature map of different images can have different sizes, but in the following specific tasks, such as classification tasks, In order to output softmax corresponding to the one-hot layer, you need to output a fixed size, in order to allow the different input can share a set of weight parameters, require the full connection layer input size is consistent, reverse push back also limits the size of the feature map must be consistent While the input images of different sizes use the same set of convolution cores (filter), they produce different sizes of feature maps, so the input images of different sizes need to be trimmed and stretched to the same size.

Solution

So there are two breakthroughs,

Allows the convolution layer to produce the same size output (SPP) for different size inputs
Enables all-connected layers to produce the same size output (full convolution) for different size inputs

The difference between full convolution and convolution is that the end is not to classify with the full-join layer, but to use the convolution layer, assuming we want to convert a 16x16 feature map to 10x1 one-hot classification, you can use 10 1x1 convolution cores, each convolution core corresponding to a classification, the number of parameters is much less, But... The experimental results show that it is quite effective, and that the whole convolution + deconvolution has opened up a new idea of image segmentation, which can be said to be a new job, interested students can read this blog

Here we give a detailed look at the SPP

The idea of SP (spatial Pyramid) in SPP is derived from the SPM (spatial Pyramid Matching), which can be referenced in this article, as stated in the paper conclusion, our studies also show that many Time-proven techniques/insights in computer vision can still play important roles in deep-networks-based recognition.

SPM is in different resolutions (scales), the image is segmented, and then to each local extraction features, these features are integrated into a final feature, this feature has a macro micro (Multiscale pyramid), preserving the regional characteristics (different regional characteristics), The similarity between the features is then used to match the images (matching). As we mentioned earlier, each filter gets a feature map,spp input that is the convolution of these feature maps, each time a feature map is segmented at different scales, and the scale L divides the image into 2^l^ A small lattice (in fact, the lattice number can be self-determined, not necessarily to be divided into 2^l^), L for the full map of 0, for each small lattice to do pooling, the paper is Max pooling, the actual can also use other, here is not like SPM need to do sift and other features extracted, because feature The map has been extracted from the convolution layer, and the pooling results are stitched together to obtain a fixed size feature map.

For example, a convolution layer with 256 filter, output 256 feature map, for a 640x320 picture, the output of feature map may be 32x16, for a 640x640 picture, output feature Map may be 32x32, to 256 feature map of each feature map, we cut them at 4 scales, cut to 1 graphs at the most coarse scale, followed by 2 sub-graphs, then 4 sub-graphs, 8, for each sub-graph to do max pooling, Get the largest number, put in the final features, you can get a 1+2+4+8=15 so long features, 256 feature can get the final 256*15 so long features, you can see, the final feature size only with the convolution layer structure and SP scale L related, and the input image is irrelevant, This ensures that images of different sizes are output in the same size.

In fact, as you can see here, you may find that the same size feature for different size outputs is determined by the pooling operation, such as Max Pooling,sum pooling, which is the operation of aggregating multiple inputs into a single value, while spatial Pyramid just let the features have a better form of organization. Of course, it is also very important to find this kind of effective characteristic organization form. But here's something still debatable, and Max pooling actually lost some of the information, although it can be compensated for by multiple layers of features.

Experiment

Then the author will apply this structure to various network structures and various tasks, and have achieved very good results (said light, reproduce a pile of papers, change the source code, run a lot of experiments, must be super tired); especially in the detection task of rcnn improvement, this place is more interesting. In rcnn, it is necessary to determine the classification of each region proposal input convolutional layer, and the area proposal is square, which leads to a lot of areas to do the repeated convolution operations.

In Spp-net's experiment,

The entire picture is only over the convolution layer, and the feature map corresponding to the whole map is obtained from the CONV5.
Then the feature map in each region proposal the corresponding part of the extraction, this location calculation is not small, but the convolution itself is much faster, the original image of an area only corresponds to an area in the feature map, but feature A region of map actually corresponds to the scope of the original image (the so-called feeling field) is larger than region proposal area, in this sense, still receive more irrelevant information, but fortunately there is no clipping or deformation;
Because the region proposal shape, the corresponding feature map size is not consistent, then SPP can give full play to its characteristics, the different sizes of the feature map into a uniform size feature, to the full connection layer classification
The original image can actually keep the aspect ratio of the original image scaled to a variety of scales (the width or height of the text will be scaled to {480, 576, 688, 864, 1200} of the five dimensions, respectively), calculate a feature, the characteristics of different scales are spliced together to classify, This combination way can improve the accuracy to some extent
There is also a small trick, you can zoom the original image to an area close to the range (the text is 224x224), and then input to the network, further improve accuracy, as for the reason ... It is not mentioned in the text that the metaphysical interpretation is that the input scale is closer and the model training is easier.

As the whole picture is only over the convolution, so much faster than the original RCNN, accurate rate is not bad

Summary

Strictly speaking spp-net is not for the detection model, but spp-net for rcnn evolution to fast-rcnn played a great role, it is worth reading. Spp-net idea is very interesting, SPP (Spatial Pyramid Pooling) is an improvement of the network structure, probably because it is Chinese writing paper, feel very good reading, gold content personal feeling no rcnn or DPM paper High, but the experiment is very rich, Proving the validity of SPP from various network structures on classification tasks and inspection tasks

Reading Paper Series: Object Detection spp-net

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reading Paper Series: Object Detection spp-net

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reading Paper Series: Object Detection spp-net

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support