Continue with the last study note, after RCNN is fast rcnn, but before fast rcnn, let's first look at a network architecture called Sppnet.
I. Introduction to SPP:
There is a fact to be clear: The CNN convolution layer does not require fixed-size images, the full-connection layer is required to enter a fixed size, so the SPP layer is put on the back of the convolution layer. Sppnet creates a fixed-length image representation by pooling an image of any size, as shown in:
Advantages of SPP: 1) Any size input, fixed size output, 2) layer, 3) can be used to pool the characteristics of any scale extraction.
Pooling
The SPP layer is structured as follows, and the pool layer immediately following the last convolutional layer is replaced with SPP as input to the fully connected layer. When we have a lot of layers of network, when the network input is an arbitrary size of the picture, this time we can continue to convolution, pooling, until the last few layers of the network, that is, we are about to connect with the full connection layer, the use of pyramid pooling, So that the arbitrary size of the feature map can be converted to a fixed-size eigenvector, this is the spatial pyramid pooling of the secret (multi-scale feature extraction of fixed-size feature vectors)
Note:
BoW, SPM
(The idea of SPP comes from SPM, and then the idea of SPM is derived from bow.)
About Bow and SPM, we found two related blog posts, not here.
http://blog.csdn.net/v_JULY_v/article/details/6555899
http://blog.csdn.net/jwh_bupt/article/details/9625469)
Second, using Sppnet for object detection:
R-CNN reuse of deep convolutional networks to extract features on ~2k windows, feature extraction is time-consuming. Sppnet more time-consuming convolution calculations are performed only once for the entire image, and then using SPP to pool the window feature graph into a fixed-length feature representation.
----------------------------------------------------------------------------------------
Here's a pair of three English words commonly used in neural network training: The difference between epoch, iteration, and batchsize:
(1) BatchSize: Batch size. In the deep learning, SGD training is generally used, that is, each training batchsize a sample training in the training concentration;
(2) Iteration:1 a iteration equal to the use of batchsize samples training once;
(3) Epoch:1 epoch equals the training of all samples in the training set;
For example, the training set has 1000 samples, batchsize=10,
Then: after training the entire sample set requires: 100 times iteration,1 epoch.
------------------------------------------------------------------------------------------
This is the original of the description of the SPP, I think very good, here first paste out, continue to study tomorrow:
---------------------------------
SPP Network, I have to say first, this method of thinking in fast rcnn, Faster Rcnn has played a pivotal role. The SPP network is mainly to solve this limitation of the fixed input layer size of the deep network, and also explains the benefits of not restricting the input size from various aspects. At the beginning of the article explained the current deep network of the drawbacks: if the fixed network input, either choose Crop Strategy, or choose warp strategy, crop is from a large image to buckle out of the network input size patch (such as 227x227), The warp is the resize of a bounding box into 227x227. Regardless of the strategy, it is obvious that there are unfavorable factors that affect the training of the network, such as crop may crop a part of the object, but can not accurately train the category, and warp will change the normal aspect ratio of object, so that the training effect becomes worse. Then, the analysis of the depth of the network needs to fixed input size because there is a full link layer, but at that time, there is no FCN thought, how to do to make the network is not limited by the size of the input? Kaiming He great God came up with different scales of pooling to pooling a fixed-scale feature map, so that you can not be constrained by the full link layer arbitrarily change the input scale. Is the core idea of the SPP network:
By feature map to the corresponding scale of pooling, so that can pooling out 4x4, 2x2, 1x1 feature map, and then the feature map concat into the column vector and the next layer of the full link layer. This eliminates the effect of inconsistent input scales. Training with the conventional method of training, but because not affected by the scale, you can do multi-scale training, that is, first resize into a few fixed scale, and then use the SPP network training, learning. So much to say here, in fact I want to talk about the following things, how SPP is used to detect above. In fact, I think the key point in this paper is to propose a mechanism for mapping a region of the original image to Conv5, although I'm not too sure about the mapping mechanism, and I'll say what I think is a reasonable mapping method. The paper is how to map, in fact, I also spent a long time to understand.
First of all, I want to explain the function of the east, of course, I am not through rigorous definition to illustrate. What is Y=f (x), I think that as long as the input x, there is a set of fixed operation F, and then produce a corresponding y, this is considered a function. According to the input there is a one by one corresponding output, this is the function. In this sense, convolution is also a function, and pooling is also a function. Of course I don't want to explain what a function is, what a function is, and actually I want to emphasize that one by one corresponds to such a relationship. As we all know, now the acquiescence of either convolution or pooling (no stride), will add the corresponding pad, so that the size of the convolution and convolution before the same, of course, this is a good thing is to make the edge is not just convolution once disappeared ~ such a son, In fact, the image of the original and convolution is one by one corresponding relationship. Each point of the original image (including the edge) can be convolution to get a new point, which is one by one correspondence. As shown (self-painted too ugly):
The green part is the picture, the purple part is the convolution nucleus.
As can be seen, the blue area is the original area, and the red area is the padding area, the purple is the convolution nucleus. The area obtained after convolution corresponds to one by one of the original area. and convolution or pooling increase stride is quite with the original image of the convolution or pooling, and then sampling, this can be one-to-the-other, so that an area of the original image can be divided by the network of all the stride to map to CONV5 after the area. Finally, let's talk about it, if you just follow the function's one by one correspondence to understand, it's easy to understand why the area of the original is divided by all the stride that map to the CONV5 area. This way, some of the operations on the original artwork can be placed on the CONV5, which reduces the complexity of the task. However, I do not too recognize this mapping mechanism, this mapping can only be a point-to-point relationship, but I think from the original image of a region R mapping to the CONV5 region R, R is sensitive to r, in other words, should r sense of the field should be the intersection with R. This way, the following is the case:
One of the blue conv neurons is wild, the red one is an area of interest in the original, and the black box I think is the area to map to the CONV5.
Using SPP to detect, the candidate proposals method (selective search) is selected first, but unlike rcnn each candidate area to the depth of the network feature, but the entire map to mention a feature, and then map the candidate frame to the CONV5, Because the size of the candidate box is different, the mapping to the CONV5 is still different, so the SPP layer will need to be extracted to the same dimension of the characteristics, and then classification and regression, the following ideas and methods are consistent with RCNN. In fact, this is a lot faster than the original, because before RCNN also raised this reason is the depth of the network needs to feel the field is very large, this way you need to enlarge the area of interest to the scale of the network in order to convolution to the CONV5 layer. This calculation will be very large, and SPP only need to calculate a feature, the rest only need to operate on the CONV5 layer. Of course, even such a perfect algorithm, but also has its flaws, may kaiming he god too put into the efficacy of SPP, so that the entire process framework has not become more perfect. First of all, in training, SPP did not play its advantage, still using the traditional training methods, which makes the computational volume is still very large, and classification and bounding box of the regression problem can also be combined learning, making the overall framework more perfect. These kaiming he is ignored, so there is a second God to make Fast rcnn.
Another great leap in rcnn--object Detection 2 (including sppnet, Fast rcnn)