One. RCNN:
1, first through the selective search, to treat the detection of the image to search out 2000 candidate windows.
2, the image of the 2k candidate window is scaled to 227*227, and then input into CNN, each candidate window sill extract a eigenvector, that is, using CNN to extract the eigenvector.
3, the corresponding feature vectors of each candidate window above, using SVM algorithm to classify and identify.
You can see that the R-CNN calculation is certainly very large, because the 2k candidate window to be input into the CNN, respectively, the feature extraction, the computational amount is certainly not large.
Two. Sppnet:
1, first through the selective search, to treat the detection of the image to search out 2000 candidate windows. This step is the same as the r-cnn.
2, feature extraction stage. This step is the biggest difference with R-CNN, the same is the convolution neural network for feature extraction, but spp-net using pyramid pooling. This step is done as follows: The entire image to be detected, input into CNN, a feature extraction, get feature maps, and then in feature maps to find the area of each candidate box, and then the candidate boxes using pyramid space pooling, extracting fixed-length feature vectors. And the R-CNN input is each candidate box, and then on the CNN, because spp-net only need to extract the entire image at once, the speed is much faster ah. Legends can be increased 100 times times faster, because R-CNN is equivalent to traversing a CNN 2000 times, and spp-net only need to traverse 1 times.
3, the last step is the same as R-CNN, using SVM algorithm to classify feature vectors.
Three. A problem:
How do I find the corresponding area of the candidate box in the original picture in feature maps?
Because the candidate box is detected by an entire image, and the size of the feature maps is different from the original image, feature maps is obtained after a series of operations such as the original image convolution and the next sampling. So how do we find the corresponding area in feature maps? Mapping a Window to Feature Maps. The author gives a very convenient formula for us to calculate: assuming (x ', Y ') represents the coordinate point on the feature map, the coordinate point (x, Y) represents the point on the original input image, then they have the following conversion relationship:
(x, y) = (s*x ', S*y ')
where S is the product of all the strides in CNN. such as the ZF-5 used by paper:
S=2*2*2*2=16
And for OVERFEAT-5/7 is s=12, this can look at the following table:
It is important to note that strides contains pooled, convolution stride. Calculate for yourself whether OVERFEAT-5/7 (the first 5 layers) is equal to 12.
In turn, we want to solve (x ', y ') by using the coordinates of (y '), then the formula is as follows:
So we enter the original image detected by Windows, we can get the four corners of each rectangle candidate box, and then we then according to the formula:
Left, Top:
Right, Bottom:
Comparison of RCNN and sppnet