Spatial pyramid Pooling (SPP)-net (space pyramid Pool) notes

Source: Internet
Author: User

Original from: http://blog.csdn.net/xzzppp/article/details/51377731

Reference paper: Spatial pyramid Pooling in Deep convolutional Networks for Visual recognition

A brief introduction to the reference paper

1. Introduction

The spatial pyramid pooling makes it possible for any size feature map to be converted to a fixed size eigenvector, which is the significance of the spatial pyramid pooling (the Multi-scale feature extracts a fixed size eigenvector) and is fed into the fully connected layer. The overall framework is: input image, convolution layer extraction feature, space pyramid pool to extract fixed size features, fully connected layer.

The detailed flowchart is as follows:

2, the general process of the specific algorithm

First, through the selective search (selective search), the detected images are searched for 2000 candidate Windows. This step is the same as R-CNN.

Feature extraction phase. This step is the biggest difference with R-CNN, the same is using convolution neural network for feature extraction, but spp-net with pyramid-pooling. This step is done as follows: The whole of the picture to be tested, input CNN, do a primary feature extraction, get feature maps, and then in feature maps to find each candidate frame area, and then to each candidate frame using pyramid space pool, extract the fixed length of eigenvector. Instead, R-CNN enters each candidate box and then goes to CNN because Spp-net only needs to extract the feature from the entire picture at once, which is fast. The legend can be increased 100 times times faster, because R-CNN is equivalent to traversing a CNN 2000 times, and spp-net only need to traverse 1 times.

Finally, SVM algorithm is used to classify eigenvector, which is similar to R-CNN.

3, the key steps to explain:

3.1 How to find the corresponding area of the candidate box in the original picture in feature maps:

The candidate box is through a whole picture of the original detection, and feature maps size and the size of the original picture is different, feature maps is through the original picture convolution, sampling and so on after a series of operations.

Directly using the formula of calculation: Suppose (x ', Y ') represents a coordinate point on a feature graph, and a coordinate point (x,y) represents a point on the original input picture, then they have the following conversion relationship:

(x,y) = (s*x ', S*y ')

where S is the product of all the step lengths (strides) in CNN, which, in turn, is solved by (x,y) coordinates (x ', y '), then the formula is as follows:

X ' =x/s+1

Enter the original image detected by Windows, you can get each rectangular candidate box four corner points, and then according to the formula:

Left, Top:x ' =x/s+1

Right, bottom:x ' =x/s-1

3.2 Spatial pyramid pooling How to extract features and obtain a fixed size eigenvector:

Let's assume a very simple two-tier network: Enter a picture of any size, assuming its size is (W,H) and output 21 neurons. When we enter a feature graph of any size, we want to extract 21 features. The spatial pyramid feature extraction process is as follows:

As shown in the figure above, when we enter a picture, we divide the picture by the scale of the different sizes. In the schematic above, using three different size scales (4*4,2*2,1*1), a picture of the input is divided, the final total can get 16+4+1=21 block, we will be from these 21 blocks, each block extracted a feature, this is exactly what we want to extract the 21-dimensional eigenvector.

The first picture, we put a complete picture, divided into 16 blocks, that is, the size of each block is (W/4,H/4);

The second picture, divided into 4 blocks, the size of each block is (W/2,H/2);

The third picture, a whole picture as a block, which is the size of the block (w,h).

The maximum pooling process of the space pyramid is to calculate the maximum value of each block from the 21 image blocks, thus obtaining an output neuron. Finally, convert a picture of any size to a fixed size 21-dimensional feature (you can, of course, design the output of other dimensions, increase the number of layers in the pyramid, or change the size of the grid). The above three different scales of division, each of which we call: The pyramid layer, each image block size we call it: Windows size. If you want a layer of the pyramid output n*n a feature, then you need to use Windows size size: (w/n,h/n) for the pool.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.