This article is for you to interpret spp-net:
Spatial Pyramid Pooling in deep convolutional Networks for Visual recognition
Motivation
The success of neural network in computer vision is due to convolutional neural networks, however, many of the existing successful neural network structures require input into a fixed size (such as 224x224,299x299), the introduction of an image, it needs to be stretched or clipped, and then input into the network for operation.
However, cropping can lose information, and stretching can distort the image, which improves the threshold of the visual task, so if a model can receive input of various scales, it should be able to make the visual task easier to accomplish.
What limits the size of the input
The core components in the deep convolutional neural network are two, one is CNN, the other is the fully connected layer, and the convolution is using filter to translate the image with the local bit multiplication, multiple filter produces multiple feature map (feature/feature), You can then use the pooling operation for further sampling to get a smaller feature map; in fact, we don't care how large feature map is, the feature map of different images can have different sizes, but in the following specific tasks, such as classification tasks, In order to output softmax corresponding to the one-hot layer, you need to output a fixed size, in order to allow the different input can share a set of weight parameters, require the full connection layer input size is consistent, reverse push back also limits the size of the feature map must be consistent While the input images of different sizes use the same set of convolution cores (filter), they produce different sizes of feature maps, so the input images of different sizes need to be trimmed and stretched to the same size.
Solution
So there are two breakthroughs,
- Allows the convolution layer to produce the same size output (SPP) for different size inputs
- Enables all-connected layers to produce the same size output (full convolution) for different size inputs
The difference between full convolution and convolution is that the end is not to classify with the full-join layer, but to use the convolution layer, assuming we want to convert a 16x16 feature map to 10x1 one-hot classification, you can use 10 1x1 convolution cores, each convolution core corresponding to a classification, the number of parameters is much less, But... The experimental results show that it is quite effective, and that the whole convolution + deconvolution has opened up a new idea of image segmentation, which can be said to be a new job, interested students can read this blog
Here we give a detailed look at the SPP
The idea of SP (spatial Pyramid) in SPP is derived from the SPM (spatial Pyramid Matching), which can be referenced in this article, as stated in the paper conclusion, our studies also show that many Time-proven techniques/insights in computer vision can still play important roles in deep-networks-based recognition.
SPM is in different resolutions (scales), the image is segmented, and then to each local extraction features, these features are integrated into a final feature, this feature has a macro micro (Multiscale pyramid), preserving the regional characteristics (different regional characteristics), The similarity between the features is then used to match the images (matching). As we mentioned earlier, each filter gets a feature map,spp input that is the convolution of these feature maps, each time a feature map is segmented at different scales, and the scale L divides the image into 2^l^ A small lattice (in fact, the lattice number can be self-determined, not necessarily to be divided into 2^l^), L for the full map of 0, for each small lattice to do pooling, the paper is Max pooling, the actual can also use other, here is not like SPM need to do sift and other features extracted, because feature The map has been extracted from the convolution layer, and the pooling results are stitched together to obtain a fixed size feature map.
For example, a convolution layer with 256 filter, output 256 feature map, for a 640x320 picture, the output of feature map may be 32x16, for a 640x640 picture, output feature Map may be 32x32, to 256 feature map of each feature map, we cut them at 4 scales, cut to 1 graphs at the most coarse scale, followed by 2 sub-graphs, then 4 sub-graphs, 8, for each sub-graph to do max pooling, Get the largest number, put in the final features, you can get a 1+2+4+8=15 so long features, 256 feature can get the final 256*15 so long features, you can see, the final feature size only with the convolution layer structure and SP scale L related, and the input image is irrelevant, This ensures that images of different sizes are output in the same size.
In fact, as you can see here, you may find that the same size feature for different size outputs is determined by the pooling operation, such as Max Pooling,sum pooling, which is the operation of aggregating multiple inputs into a single value, while spatial Pyramid just let the features have a better form of organization. Of course, it is also very important to find this kind of effective characteristic organization form. But here's something still debatable, and Max pooling actually lost some of the information, although it can be compensated for by multiple layers of features.
Experiment
Then the author will apply this structure to various network structures and various tasks, and have achieved very good results (said light, reproduce a pile of papers, change the source code, run a lot of experiments, must be super tired); especially in the detection task of rcnn improvement, this place is more interesting. In rcnn, it is necessary to determine the classification of each region proposal input convolutional layer, and the area proposal is square, which leads to a lot of areas to do the repeated convolution operations.
In Spp-net's experiment,
- The entire picture is only over the convolution layer, and the feature map corresponding to the whole map is obtained from the CONV5.
- Then the feature map in each region proposal the corresponding part of the extraction, this location calculation is not small, but the convolution itself is much faster, the original image of an area only corresponds to an area in the feature map, but feature A region of map actually corresponds to the scope of the original image (the so-called feeling field) is larger than region proposal area, in this sense, still receive more irrelevant information, but fortunately there is no clipping or deformation;
- Because the region proposal shape, the corresponding feature map size is not consistent, then SPP can give full play to its characteristics, the different sizes of the feature map into a uniform size feature, to the full connection layer classification
- The original image can actually keep the aspect ratio of the original image scaled to a variety of scales (the width or height of the text will be scaled to {480, 576, 688, 864, 1200} of the five dimensions, respectively), calculate a feature, the characteristics of different scales are spliced together to classify, This combination way can improve the accuracy to some extent
- There is also a small trick, you can zoom the original image to an area close to the range (the text is 224x224), and then input to the network, further improve accuracy, as for the reason ... It is not mentioned in the text that the metaphysical interpretation is that the input scale is closer and the model training is easier.
As the whole picture is only over the convolution, so much faster than the original RCNN, accurate rate is not bad
Summary
Strictly speaking spp-net is not for the detection model, but spp-net for rcnn evolution to fast-rcnn played a great role, it is worth reading. Spp-net idea is very interesting, SPP (Spatial Pyramid Pooling) is an improvement of the network structure, probably because it is Chinese writing paper, feel very good reading, gold content personal feeling no rcnn or DPM paper High, but the experiment is very rich, Proving the validity of SPP from various network structures on classification tasks and inspection tasks
Reading Paper Series: Object Detection spp-net