Section 28th, the R-CNN algorithm of target detection algorithm

Last Update:2018-06-30 Source: Internet

Author: User

Tags prepare svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE Conference on Computer vision and pattern recognition. 2014.

The full name of R-CNN is REGION-CNN, which can be said to be the first algorithm to successfully apply deep learning to target detection. The fast r-cnn, Faster r-cnn are all based on R-CNN.

Most of the traditional target detection algorithms are based on image recognition. In general, you can use the exhaustive method or the sliding window to select the area box that all objects may appear on the image, extract the features of these area boxes and use the image Recognition classification method, and then suppress the output by a non-maximal value after all the successful regions of the classification are obtained.

A r-cnn idea

R-CNN follow the traditional target detection, the same extraction box, the extraction of each box feature, image classification, non-maximum value suppression Four steps to target detection, but in a partial improvement.

The classical target detection algorithm uses the sliding window method to judge all possible areas in turn. In this case, a series of candidate areas that are more likely to be objects are extracted, and then only the features are extracted in these candidate areas.
The traditional features (such as sift,hog characteristics, etc.) are replaced by the deep convolution network extraction features.

Use two databases during training:
A larger identification library (ImageNet ilsvc 2012): Calibrate the category of objects in each picture. 10 million images, 1000 classes.
A smaller test base (PASCAL VOC 2007): Calibrate the category and position of the object in each picture. 10,000 images, 20 classes.
Use the identification library for pre-training, and then use the detection library to tune parameters. Finally, the evaluation on the test library.

A brief introduction to two algorithms

The dataset uses Pascal VOC, which is a total of 20 categories of the object in this dataset. First, the Select Search method selects about 2000 region Proposal,region proposal on each image, which is where the object is likely to appear. Then the training and test samples are constructed according to these region proposal, noting that the size of these region proposal is different, and the sample category is 21 (including the background).

Then there is the pre-training, which is trained with alexnet under the Imagenet data set. And then on our data set fine-tuning, the network structure is unchanged (except the last layer of output from 1000 to 21), the input is the front of the region proposal to a uniform size 227*227, preserving F7 output features 2000*4096 dimension.

For each category (altogether 20 classes) training a SVM classifier, with the output of the F7 layer as input, training the weight of the SVM 4096*20 dimension, so the test will get 2000*20 score output, and the test will be on this score output NMS (Non-maximun suppression), simply speaking, is the process of removing the repeating box. At the same time, for each category (altogether 20 classes) a regression is trained, and the input is the characteristic of the pool5 and the coordinates of each pair are long and wide.

Three training steps

R-CNN training can be divided into the following steps:

Prepare for region proposal. For all the images in the training set, the selective search method is used, and the last image gets 2000 region proposal.
Prepare positive and negative samples. If a region proposal and all ground Truth (marks) on the current image have the largest overlapping area of IOU greater than or equal to 0.5, then the range proposal is a positive sample of this ground truth category, otherwise it is a negative sample. The ground Truth is also included in the sample. Since VOC contains altogether 20 categories, the region proposal here is the 20+1=21 class, and 1 represents the background. Simply say the concept of IOU, IOU is the calculation of the coincidence degree of the rectangle box A, B formula: iou= (A∩B)/(A∪B), the greater the overlap, the more similar to the two.
Pre-training. This step is mainly due to the fact that there is a small amount of labeled sample data in the detection problem and it is difficult to train large scale. The use of Krizhevsky in the 2012 's famous network alexnet to learn features, including 5 convolutional layers and 2 fully connected layers, in Caffe framework using ILSVRC 2012 of the data set for pre-training, in fact, the use of large data set training a classifier, this ILSVRC The 2012 dataset is the well-known data set of the Imagenet competition and is also a color image classification.
Fine-tuning. The size of the samples obtained in 2 is transformed to make the same size, which is due to the size of region proposal in 2, so it is necessary to change the region proposal to form 227*227. In this article, all region proposal, regardless of size and aspect ratio, are stretched directly to a fixed size. Then as the input of the pre-trained network in 3, continue to train the network, continue training is actually migration learning. In addition, since ILSVRC 2012 is a 1000-class dataset, and the data set in this article is 21 classes (including 20 VOC categories and a background category), the migration is modified to change the output of the last fully connected layer from 1000 to 21, and the other structures are unchanged. Preserve the characteristics of the F7 after the training is finished.
Train a SVM two classifier for each category. The input is a feature of F7, the output dimension of F7 is 2000*4096, the output is whether it belongs to the category, the training result is to get the SVM weight matrix w,w dimension is 4096*20. Here the negative sample selection and the previous difference, the threshold value of IOU from 0.5 to 0.3, that is, iou<0.3 is a negative sample, the positive sample is ground Truth. IOU threshold selection is not the same as the previous fine-tuning, mainly because: the front fine-tuning need a large number of samples, so set to 0.5 will be more relaxed. The SVM phase is because the SVM is suitable for small samples, so setting 0.3 is a bit stricter.
Regression. Use the POOL5 feature 6*6*256 and bounding box's ground truth to train the regression, each type of regression is trained separately. The input is the characteristics of the POOL5, and the coordinates and the length-width values of each pair of samples. In addition only to those with ground truth IOU more than a certain threshold and IOU the largest proposal return, the rest of the region proposal do not participate. In detail: For a region proposal:r, and its corresponding ground truth:g, we want to predict the result: p, then we certainly want p to be as close to G as possible. Here the transformation function f (x) is obtained by making a linear transformation of the feature X of the POOL5 layer, which acts on the coordinates of R to achieve the function of the regression (including the translation of x, Y and the scaling of the w,h). Therefore, the loss function can be expressed as: the gap between R and G minus the difference between P and G is as small as possible.

The R-CNN test can be divided into the following steps:

Enter an image and use selective search to get 2000 region proposal.
The output of the F7 layer is 2000*4096 for all region proposal transformed to fixed size and as input to the trained CNN network, resulting in a 4096-dimensional feature of the F7 layer.
For each category, the extracted features are scored using the trained class of SVM classifier, so the weight matrix of SVM is 4096*n,n is the number of categories, here a total of 20 svm,n=20 note is not 21. The score matrix is 2000*20, which indicates that each region proposal belongs to a category of scores.
Using the Non-maximun suppression (NMS) to eliminate the region proposal in each column of the scoring matrix is to remove several region proposal with a higher repetition rate, and to get the number of areas with the highest score in the column. Proposal. The NMS means: for example, for a column score in 2000*20, find a region proposal with the highest score, and then remove the number if the other region proposal and the highest score IOU exceed a certain threshold. Proposal. After this round of culling, and then from the remaining region proposal to find the highest score, and then calculate the other region proposal and the highest score IOU whether the threshold is exceeded, more than continue culling, until there is no region proposal left. This is done for each column so that in the end each column (that is, each category) can have some region proposal.
Using the n=20 regression to return the 20 categories of region proposal from the 4th step, the characteristics of POOL5 layer are used. The weight of the POOL5 feature W is the result of the training phase and is used directly when testing. Finally get the revised bounding box for each category.

Four advantages and disadvantages

Although R-CNN's recognition framework is not very different from traditional methods, the r-cnn effect is much better than traditional methods thanks to the excellent feature extraction capabilities of CNN. For example, on the VOC2007 dataset, the traditional method has the highest average accuracy map of about 40%, while the R-CNN map reaches 58.5%.

The disadvantage of R-CNN is the high computational capacity. R-CNN process, including the selection of region proposal, training convolutional neural network (Softmax classifier,log loss), training SVM (hinge loss) and training regressor (squared loss), This makes the training time very long (84 hours) and occupies a large amount of disk space. In the process of training convolutional neural network for each region proposal to calculate convolution, which repeated too many unnecessary calculations, imagine an image can get more than 2000 region proposal, most of them overlap, so based on region The proposal convolution is computationally large, and this is the main problem that fast R-CNN solves.

Reference article: R-CNN algorithm detailed RCNN algorithm detailed http://www.robots.ox.ac.uk/~tvg/publications/talks/fast-rcnn-slides.pdf r-cnn detailed r-cnn thesis detailed ( Recommended, explained in detail) Rich feature hierarchies for accurate object detection and semantic segmentation

Section 28th, the R-CNN algorithm of target detection algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More