Progress of deep convolution neural network in target detection

Last Update:2016-11-03 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Travelsea
Links: https://zhuanlan.zhihu.com/p/22045213
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

In recent years, the Deep convolutional Neural Network (DCNN) has been significantly improved in image classification and recognition. Looking back from 2014 to 2016 of these two years more time, has sprung up R-cnn,fast r-cnn, Faster r-cnn, ION, Hypernet, SDP-CRC, YOLO,G-CNN, SSD and other increasingly fast and accurate target detection methods.

A method based on region proposal

The basic idea of this kind of method is to obtain the candidate region and then classify the candidate area and the border regression.

1.1 R-CNN [1]

R-CNN is an earlier method of using DCNN for target detection. The main idea is to use DCNN for feature extraction in the image and classify it using a SVM, the result of classification is a preliminary test result, then use DCNN feature again, and combine with another SVM regression model to get more precise bounding box.

The method of obtaining candidate areas is a common selective search. In a graph, you can get about 2000 candidate areas of different sizes and categories, and they need to be transformed to the same size to accommodate the size of the image (227x227) that CNN handles.

The CNN structure used in this article is from Alexnet, which has been trained in the 1000 categories of classification tasks on the Imagenet dataset, and then adjusted by parameters to adapt the network structure to the 21 categories of classification tasks in the article.

This method obtains 71.8% detection accuracy on the VOC test data set. The disadvantage of this method is: 1, the training and testing process is divided into several stages: obtaining candidate region, dcnn feature extraction, SVM classification, SVM bounding box regression, training process is very time-consuming. 2, the training process needs to preserve the characteristics of DCNN, the memory space. 3, in the test process, each candidate area to be extracted once the characteristics, and these areas have a certain degree of overlap, the characteristics of each region to extract independent calculation, the efficiency is not high, so that testing a picture is very slow.

1.2 Fast r-cnn[2]

On the basis of R-CNN, in order to make the training and testing process faster, Ross Girshick proposed fast r-cnn, which uses VGG19 network structure 9 times times and 213 times times faster than R-CNN in training and testing. The main idea is: 1, the entire image is convolution to obtain a feature image instead of each candidate region to calculate the convolution, 2, the candidate area classification and border fitting two steps together instead of separate. Schematic diagram is as follows:

In this article, the ROI Pooling Layer is used to convert the features of the candidate regions of different sizes into fixed-size feature images by assuming that the ROI size of the candidate region is the size of the output, then the ROI is divided into a grid, and the size of each lattice is Then use max-pooling for each grid to get the feature image of the target size.

The combination of candidate area classification and border fitting is through a dual-task network structure: using two fully-connected output layers for category prediction and border prediction (as shown), the two tasks are trained simultaneously, using a joint cost function:

The two items in the formula are classification loss and regression loss respectively. This method is much faster than R-CNN. In particular, when testing a new image, real-time detection can be achieved without considering the time to generate the candidate area. The selective search algorithm for generating candidate regions takes approximately 2s of time to process an image, thus becoming a bottleneck for this approach.

1.3 Faster R-cnn[3]

Both of the above methods depend on the selective search generation candidate region, it is very time-consuming, then can directly use convolutional neural network to obtain the candidate region? In this way, the candidate area can be obtained with almost no additional time cost.

Shaoqing Ren proposed faster r-cnn to realize this idea: Assume that there are two convolutional neural networks, one is a region-generated network, each candidate region of the image is obtained, and the other is the classification of the candidate region and the border regression network. The first few layers of the two networks to calculate the convolution, if they have to share the parameters in these layers, only at the end of a few layers to achieve their own specific target task, then a single image only with these several shared convolution layer for a forward convolution calculation, you can simultaneously obtain the candidate region and the candidate region of the category and border.

The schematic diagram of the candidate Area generation Network (region proposal networks, RPN) method first obtains a feature image by the digital convolution of the input image, and then generates a candidate region on the feature image, using a sliding window (3). The local feature image is converted to a low-dimensional feature, predicting whether the region (CLS layer, output) is a candidate region and a corresponding border (reg layer, output). The area here is called an anchor (anchor), which corresponds to a rectangular box of different sizes and different aspect ratios that have the same center as the sliding window. Assuming the feature image size after convolution, there is a total anchor. This method of feature extraction and candidate region generation has displacement invariance.

After using RPN to get the candidate region, the classification and border regression of the candidate regions still use fast r-cnn. The two networks use a common convolution layer. Because of the need to use a fixed candidate region generation method during fast R-CNN training, the RPN and fast r-cnn can not be trained using the reverse propagation algorithm at the same time. The article uses four steps to complete the training process: 1, to train rpn;2 separately, and to train fast r-cnn by using the region generation method obtained in step 1; 3, use Step 2 to get the network as the initial network training RPN. 4, train fast r-cnn again, fine tune the parameters.

The accuracy of the Faster r-cnn is similar to that of fast r-cnn, but the training time and test time are reduced by 10 times times.

1.4 Ion:inside-outside Net[4]

Ion is also based on region proposal, and, on the basis of the candidate regions, to further improve the predictive accuracy of ROI in each of the candidate areas of interest, Ion considers information other than the information and ROI within the ROI, There are two innovations: one is to combine contextual features with spatial recurrent neural networks (spatial recurrent neural network) instead of using only local features within the ROI, and to connect features from different convolution layers. Used as a multiscale feature to predict.

Ion uses RNN independently in the upper, lower, left, and right directions, and combines their outputs into a feature output, which is characterized by two times of such a process, and then connected to the output features of the previous convolution layers, which includes both contextual information and multi-scale information.

1.5 Hypernet[5]

Hypernet on the basis of faster r-cnn, there is a further improvement in obtaining better candidate areas than r-cnn used in faster RPN. The idea is to combine feature images from different convolution layers to produce better region proposal and detection accuracy.

The paper combines the output of different convolution layers into hyper Feature. Because the output size of different convolution layers is not the same, the higher resolution of the feature image in the shallow layer is beneficial to improve the accuracy of the bounding box, but it is easy to classify the object in the bounding box, but the image resolution is very low and the bounding box of the smaller objects is not accurate, but these features are more abstract, Can make the classification of objects more accurate. Therefore, the combination of the two can help target detection accuracy and positioning precision.

1.6, Sdp-crc[6]

The SDP-CRC proposes two strategies to deal with the targets of different scales and to improve the computational efficiency of the candidate regions. The first strategy is based on a pool of candidate area scales, the scale Department Pooling (SDP). In the CNN framework, because the input image has to go through multiple convolution, the features of small size objects on the convolution output of the last layer do not describe the object well. If you use the characteristics of a previous layer, you can better describe the small object, using the characteristics of the back layer, it is better to describe the larger object.

So the idea of SDP is to describe the object by selecting the appropriate convolution layer on the size of the object. For example, if the height of a candidate area is between 0-64 pixels, the characteristics of the third convolution layer (for example, Conv3 in Vgg) are used to pooling as the input characteristics of the classifier and border regression, and if the candidate area is above 128 pixels, the last convolution layer is used ( For example, Conv5 in Vgg) are categorized and regressed.

The second strategy is to use a cascading classifier that discards negative samples, namely cascaded rejection Classifer, CRC. One bottleneck for Fast rcnn is that there are a number of candidate areas that are time-consuming to complete classification and regression calculations for thousands of candidate areas. The CRC can be used to quickly exclude some candidate areas that clearly do not contain an object, concentrating only the complete calculation on the candidate areas that are most likely to contain an object. This article uses the AdaBoost method, which uses the characteristics of each convolution layer sequentially, and excludes negative samples by some cascaded weak classifiers. On the feature image of the last layer convolution, the remaining candidate areas are then classified and returned.

The accuracy of the SDP-CRC is much higher than that of fast rnn, and the detection time is reduced to 471ms per frame.

2, the method of predicting the bounding box directly without using region propsal

2.1 Yolo[7]

The idea of YOLO is to abandon the intermediate steps of the generating candidate region, and to directly regress the bounding boxes through a single convolutional neural network and predict the probability of the corresponding categories. This method divides the input image into a grid of sizes. Each grid cell predicts the confidence level of the B bounding box and these bounding boxes, with five predictions: the center of the bounding box is relative to the center of the grid cell, the bounding box is the width and height of the entire image, and the confidence of the bounding box (based on the IOU between Ground truth). Each cell also predicts the probability that the cell belongs to a category, so the output of the entire network is a size tensor. In the experiment, the size of the output is therefore.

In the test phase, the category probability of a cell is multiplied by the confidence level of the cell's B bounding box, and each bounding box contains the confidence of each category of object.

The advantage of YOLO is that the 24-layer convolutional network used in this article can reach 45 frames per second on the test image, while another simplified network structure can be used to reach 155 frames per second. The disadvantages of this method are: 1, the prediction of the bounding box has a large space limit, for example, each cell only predicts two bounding boxes, and there is only one category. 2, this method does not well detect a number of small groups of targets, such as a group of birds. 3, if the aspect ratio of the detection target is not present or uncommon in the training data, the generalization ability of the model is weak.

2.2, G-cnn[8]

G-CNN the problem of target detection as a gradual change of the detection frame from some fixed mesh to the true border of the object. This is a process that has been constantly updated over several iterations.

Its schematic diagram as shown above, the initial detection frame is the entire image of the different scales of the grid, after convolution to obtain the object's feature image, the original frame corresponding to the feature image through fast r-cnn method into a fixed-size feature image, through the regression to get more accurate border, Once again, use this new border as the initial border to make a new iteration. After several iterations of the border as output.

The G-CNN uses about 180 initial frames, after 5 iterations, the detection frame rate is around 3fps, the accuracy rate is better than the fast r-cnn.

2.3, Ssd[9]

SSDs also use a single convolutional neural network to convolution an image, predicting a range of bounding boxes for different dimensions and aspect ratios at each location of the feature image. During the testing phase, the network predicts the likelihood of each category of objects contained in each bounding box, and adjusts the bounding box to fit the shape of the target object.

The SSD requires only one input image and the bounding box of the object appearing in the image during training. In different convolution layers the output is a feature image of different scales (e.g., and), at each location on the feature image of several layers, calculates the confidence of each target object and the deviation of the target object's true bounding box from the default bounding box within the default bounding box for several (such as 4). Therefore, for a feature image of size, a total output is generated. This is somewhat similar to the concept of anchors in faster r-cnn, but this concept is used in different resolution feature images. SSD and YOLO for example:

On the VOC 2007 test image, for 300300 size input image, SSD can achieve 72.1% map accuracy rate of 58 frames per second, and can predict more than 7k bounding box, and YOLO can only predict 98. Is the performance comparison of the above algorithms:

Reference documents

[1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

[2] Girshick, Ross. "Fast r-cnn." ICCV2015.

[3] Ren, shaoqing, et al. "Faster r-cnn:towards Real-time object detection with region proposal networks." Advances in Neural information processing systems. 2015.

[4] Bell, Sean, et al. "Inside-outside net:detecting objects in the context with skip pooling and recurrent neural networks." arXiv preprint arxiv:1512.04143 (2015).

[5] Kong, Tao, et al. "Hypernet:towards accurate region proposal Generation and Joint Object Detection." arXiv preprint arxiv:1604.00600 (2016).

[6] Yang, Fan, Wongun Choi, and Yuanqing Lin. "Exploit all the Layers:fast and accurate CNN object detector with Scale de pendent pooling and cascaded rejection classifiers. " CVPR 2016.

[7] Redmon, Joseph, et al. "You are only look once:unified, real-time object detection." arXiv preprint arxiv:1506.02640 (2015).

[8] Najibi, Mahyar, Mohammad Rastegari, and Larry S. Davis. "G-cnn:an iterative Grid Based Object Detector." arXiv preprint arxiv:1512.07729 (2015).

[9] Liu, Wei, et al. "Ssd:single Shot multibox Detector." arXiv preprint arxiv:1512.02325 (2015).

Progress of deep convolution neural network in target detection

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More