29th, the fast R-CNN algorithm of target detection algorithm is detailed

Last Update:2018-06-30 Source: Internet

Author: User

Tags ord svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on computer Vision. 2015.

Following the 2014 RCNN, Ross Girshick introduced fast rcnn in 15, with a sophisticated and compact process that significantly increased the speed of target detection. The source code is available on GitHub.

The reason why fast r-cnn is proposed is mainly because R-CNN has the following problems:

Training is more than a step. Through the previous blog we know R-CNN training to fine tuning a pre-trained network, and then training for each class A SVM classifier, and finally use Regressors to return to Bounding-box, in addition to the region Proposal also to be alone with selective search way to obtain, the step is more cumbersome.
Time and memory consumption is relatively large. In the training of SVM and regression, it is necessary to use the characteristics of network training as input, and the time consumption of the feature saved on disk is still relatively large.
The test time is also relatively slow, each picture of each region proposal to do convolution, too many repeated operations.

Although the sppnet algorithm was proposed before fast rcnn to solve the problem of repeated convolution in the RCNN, Sppnet still exists and rcnn the same shortcomings such as: too many training steps, the need to train the SVM classifier, the need for additional regression, features are also stored on the disk. As a result, fast rcnn is the equivalent of a comprehensive improvement of the original two algorithms, not only the training steps are reduced, and there is no need to save the feature on disk.

The Fast RCNN algorithm based on VGG16 is nearly 9 times times faster than RCNN in training speed, about 3 times times faster than Sppnet, and 213 times times faster than RCNN, 10 times times faster than Sppnet. The map on the VOC2012 is around 66%.

A Fast r-cnn thought

The Fast rcnn method solves the problem of the Rcnn method three:

Problem one: slow test times

RCNN a large number of overlapping candidate boxes in an image, extracting feature operation redundancy.
In this paper, the whole image is normalized and fed into the deep network directly. At adjacency, the candidate box information is added, and each candidate box is processed at the end of the few layers.

Problem two: slow training times

The reason is ditto.
In training, this paper first sends an image into the network, and then feeds the candidate area extracted from the image. The first few layers of these candidate areas do not need to be repeated.

Question three: The space required for training

Independent classifiers and regressions in RCNN require a large number of features as training samples.
In this paper, the classification and location of fine tuning unified with deep network, no need for additional storage.

A brief introduction to two algorithms

The main network of the algorithm is VGG16

Here are the steps for training:

The input is 224*224, followed by 5 convolution layers and 2 drop-down sampling layers (these two drop-down sampling layers are followed by the first and second convolution layers, respectively)
Enter the ROI pooling layer, the input of the layer is the output of the CONV5 layer and the 5*p candidate region proposal (image sequence x1+ geometric position x4, serial number for training).
And then after two is all the output is 4096 of the fully connected layer.
Finally, respectively, the output number is 21 and 84 of the two full-join layer (the two full-join layer is tied, not the relationship), the former is the classification of outputs, representing each region proposal belong to each category (21 class) of the score, the latter is the output of the regression, representing each region The four coordinates of the proposal.
Finally, two loss layers are classified as softmaxwithloss, the input is the score of the label and the output of the classification layer, and the return is Smoothl1loss, and the input is the output and target coordinates of the regression layer and the weight.

The process of testing:
And the training is basically the same, the last two loss layer to change to a SOFTMA layer, the input is classified score, the output probability. Finally, NMS (Non-maximun suppression) is used for each category.

Three-algorithm explanation

The flowchart of Fast R-CNN is as follows: the input of this network is the original picture and candidate area, the output is classification category and Bbox regression value. For the candidate box area in the original picture, as in sppnet, it is mapped to the corresponding area of the convolution feature (that is, the ROI projection in the figure), and then input to the ROI pooling layer to obtain a fixed-size feature map. After 2 full-connected layers of this feature map, the feature of ROI is obtained, then the feature is passed through the all-connected layer, using Softmax to get the classification, and using regression to get the border regression. The main structure of CNN can come from the alexnet, or from the vggnet.

1. ROI Pooling Layer

The only thing to explain here is the ROI pooling layer. If the ROI size on the feature map (feature map) ish? W (the number of channels is ignored here) , this feature map is divided intoh/h? W/W x grid, each grid size is h*w, do max pooling for each grid, so the size after pooling is H & #x2217; W " >hw (in the article, the VGG16 network uses the h=w=7 parameter, which is drawn in 6x6) . No matter how large the original ROI is, it turns into a 7*7-sized feature map. In this paper, the ROI pool into the output of 7*7, in fact, this layer is a special case of SPP, SPP pyramid only one layer is this.

H & #x2217; W " > H / H & #x2217; W / W " > H & #x2217; W " > So you can see that there are 3 main improvements to fast rcnn:

H & #x2217 ; W " > H / H & #x2217; W / W " > H & #x2217; W " The convolution is no longer performed on each region proposal, but directly on the entire image, thus reducing a lot of repetitive calculations.
The original rcnn is to each region proposal to do the convolution, because an image has about 2000 of the region proposal, it is certain that the overlap rate between each other is very high, Therefore, duplicate calculations are generated. The dimensional transformation of the feature is performed with ROI pooling because the input of the full join layer requires the same size, so the region proposal cannot be entered directly. The
trains regressor into the network, each of which corresponds to a regressor, and replaces the original SVM classifier with Softmax.

2. Training

The training of the network needs to be considered from the following directions: 1, what is the training sample, 2, what is the loss function, 3, if a new network structure is proposed, how to do the reverse propagation of the network structure. In addition, we can also pay attention to the selection of super-parameters, see the author in the super-parameter selection on what good ideas can be used for reference.

3. Training samples

From the forward propagation of the network, you can see that the input required by the network is a picture and a candidate area, the output is a category and bbox, so the training of the picture each candidate area needs to be marked in advance category and Bbox.

The author uses hierarchical sampling to select training pictures. For each mini-batch, the size is 128 region proposal (or ROI). Select 2 images from the training image, and choose 64 ROI for each image to form these 128 ROI. In this way, the convolution calculation in front of the network can be shared, reducing the complexity of the training. Of the 64 ROI, 25% is a category (Iou>0.5,u≥1 ), the remaining 75% is the background (Iou∈[0.1,0.5),u=0 ). Data enhancement uses a horizontal flip. At the time of the test, approximately 2000 ROI per image.

I o U > 0.5 & #xFF0C; u & #x2265; 1 " > I o U & #x2208; [ 0.1 , 0.5 ) , u = 0 " > 4, loss function

The classification of the loss and regression of the loss integrated together, wherein the classification using log loss, that is, the probability of real classification (PU) negative log, and the regression loss and r-cnn basically the same. The classification layer outputs a k+1 dimension, representing K classes and a background class.

This is the loss of regression, where tu represents the predicted result, and U represents the category. V represents the true result, i.e. bounding box regression target. Output the 4*k-d array T, which indicates that the scaled parameter should be translated when it belongs to the K class, respectively.

Reference article: Target detection: Fast R-CNN fast RCNN algorithm detailed fast RCNN algorithm

29th, the fast R-CNN algorithm of target detection algorithm is detailed

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More