Target detection is a simple task for people, but for computers, it can see some arrays with values of 0-255, so it is difficult to directly get the high-level semantic concept of someone or cat in the image, and it is not clear which region the target eats in the image. The target in the image may appear in any position, the shape of the target may have various changes, and the background of the image varies greatly These factors lead to the fact that target detection is not an easy task to solve.
Thanks to deep learning, mainly convolution neural network and candidate region algorithm, target detection has made a great breakthrough since 2014. This paper mainly analyzes and summarizes the target detection algorithm based on deep learning. The article is divided into four parts: the first part introduces the process of traditional target detection, the second part introduces the target detection framework (r-cnn, spp-net, fast r-cnn) represented by r-cnn, which combines region proposal and CNN classification, Fast r-cnn); the third part introduces the target detection framework (Yolo, SSD) represented by Yolo, which transforms target detection into regression problem; the fourth part introduces some skills and methods that can improve the performance of target detection.
1、 Traditional target detection method
the traditional target detection method is generally divided into three stages: first, select some candidate regions on the given image, then extract features from these regions, and finally, classify the trained classifiers. Now we will introduce the three stages respectively.
1) Region selection
This step is to locate the target position. Because the target may appear in any position of the image, and the size, length width ratio of the target are also uncertain, the strategy of sliding window is used to traverse the whole image at first, and different scales and length width ratios need to be set. Although this exhaustive strategy includes all possible positions of the target, its disadvantages are also obvious: too high time complexity and too many redundant windows, which also seriously affects the speed and performance of subsequent feature extraction and classification. (in fact, due to the problem of time complexity, the aspect ratio of sliding window is usually fixed, so for multi-category target detection with large aspect ratio floating, even the traversal of sliding window can not get a good area.)
2) Feature extraction
It is not easy to design a robust feature because of the diversity of target shape, illumination change and background. However, the quality of feature extraction directly affects the accuracy of classification. (common features in this stage include sift, hog, etc.)
3) Classifier
It mainly includes SVM, AdaBoost, etc.
Summary: there are two main problems in traditional target detection: one is that the region selection strategy based on sliding window is not targeted, time complexity is high, and window redundancy; the other is that the features designed by hand are not very robust to the change of diversity.
2、 Deep learning target detection algorithm based on region proposal
How can we solve the two main problems of traditional target detection tasks?
For the problem of sliding window, region proposal provides a good solution. Region proposal is to find out the possible position of the target in the graph in advance. However, because region proposal makes use of texture, edge, color and other information in the image, it can ensure a high recall rate when selecting fewer windows (thousands or even hundreds). This greatly reduces the time complexity of subsequent operations, and the quality of the candidate window obtained is higher than that of the sliding window (fixed aspect ratio of the sliding window). The commonly used region proposal algorithms include selective search and edge boxes. If you want to know more about region proposal, you can see pami2015's "what makes for effective detection proposals?"
With candidate regions, the rest of the work is actually the image classification of candidate regions (feature extraction + classification). For image classification, it has to be mentioned that at the 2012 image net large-scale visual challenge (ilsvrc), Professor Geoffrey Hinton, a machine learning champion, led his student krizhevsky to reduce the top-5 error of ilsvrc classification task to 15.3% using convolutional neural network, while the top-5 error of the second place using traditional methods was as high as 26.2%. Since then, convolutional neural network has occupied the absolute dominant position of image classification task. The top-5 error of Microsoft's latest RESNET and Google's perception V4 model has been reduced to more than 4%, which has exceeded the ability of human in this specific task. So it is a good choice to use CNN to classify the target image after the candidate region is detected.
In 2014, RBG (Ross B. girshick) used region proposal + CNN to replace the sliding window + manual design features used in traditional target detection, designed the r-cnn framework, made a great breakthrough in target detection, and started the upsurge of target detection based on deep learning.
1)R-CNN(CVPR2014,TPAMI2015)
(Region-based Convolution Networks for Accurate Object detection and Segmentation)
The above frame diagram clearly shows the target detection process of r-cnn:
(1) Input test image
(2) About 2000 region proposals are extracted from the image by using selective search algorithm.
(3) Scale each region proposal (Warp) to 227 * 227 and input it to CNN. Take the output of CNN's fc7 layer as the feature.
(4) The CNN features extracted from each region proposal are input into SVM for classification.
Several explanations are given for the above framework:
*The above frame chart is the flow chart of the test. To test, we first need to train the CNN model for feature extraction and SVM for classification: it is the CNN model for feature extraction by fine tuning the training model (alexnet / vgg16) on yongimagenet, and then use CNN model to train SVM for feature extraction of training set.
*The scaling of each region proposal to the agreed scale is because CNN full link layer input needs to ensure that the dimension is fixed.
*For each region proposal, scaling to the agreed scale is because CNN's full connection layer input needs to be guaranteed to be unique and fixed.
*The above figure takes less time to do the bounding box regression for the SVM classified region proposal, which is a linear regression algorithm to correct the region proposal. In order to make the window extracted by the region proposal and the target really gentle. Because the window extracted by Region proposals can't be as accurate as the manual mark. If the location of region proposals and the target is relatively cheap and large, even if the classification is correct, but because the ratio of the intersection and union of region proposals and the ground truth is less than 0.5, the equivalent target is still not detected.
Summary: r-cnn's detection results on Pascal voc207 directly increased from 34.3% of DPM HSC to 66% (map). Such a big improvement is that we see the huge advantages of region proposal + CNN.
However, there are many problems in the r-cnn framework:
(1) Training can be divided into several stages with complicated steps: fine tuning network + training SVM + training border regression
(2) Training time and disk space: 5000 graphics generate hundreds of G feature files
(3) Slow speed: it takes 47s to process an image with GPU and vgg16 model.
Aiming at the problem of slow speed, spp-net gives a good solution.
2)SPP-NET(ECCV2014,TPAMI2015)
(Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition)
First of all, let's see why r-cnn is so slow in detection. A picture needs 47s! Take a closer look at the r-cnn framework and find that after kicking the region proposal (about 2000 images), each proposal is treated as an image for subsequent processing (CNN feature extraction + SVM classification). In fact, 2000 feature extraction and classification processes have been carried out for an image!
Is there any way to speed up? It seems that there are 2000 region proposals that are not all part of the image. Then we can provide the convolution layer feature once for the image, and input the circle integration feature of each region proposal to the full connection layer for subsequent operations. (for CNN, most of the calculations are good at the coiler operation, which can save a lot of time.). Now, the problem is that the scale of each region proposal is different. It is definitely not possible to input the full connection layer directly, because the input of the full connection layer must be a fixed length. Spp-net can solve this problem:
The figure above corresponds to the network structure diagram of spp-net. Input any image to CNN, and we can get the relevant features after the winding operation (for example, the final convolution layer of vgg16 is conv5 μ 3, generating 512 feature diagrams in total). The window in the figure is the area corresponding to the feature map in a region proposal of the original figure. It is only necessary to map the features of these windows of different sizes to the same dimension, and take them as the full connection input, so as to ensure that the detachment image can extract a convolution layer feature. Spp-net uses spatial pyramid sampling: each window is divided into 4 * 4, 2 * 2, 1 * 1 blocks, and then each block is sampled under Max pooling, so that after each window passes through the SPP layer, a feature vector with a length of (4 * 4 + 2 * 2 + 1) 512 dimension is obtained, which is used as the input of the full connection layer for subsequent operations.
Summary: compared with r-cnn, spp-net can greatly speed up target detection, but there are still many problems:
(1) Training can be divided into several stages with complicated steps: fine tuning network + training SVM + training border regression
(2) Spp-net fixed the convolution layer when fine tuning the network, only fine tuning the full connection layer. For a new task, it is necessary to fine tune the convolution layer. (the features extracted by the classified model pay more attention to high-level semantics, while the target detection task needs the background position of the target in addition to giving information)
To solve these two problems, RBG proposes fast r-cnn, a simple and fast target detection framework.
3)Fast R-NN(ICCV2015)
With the introduction of r-cnn and spp-net in front, we can directly see the frame diagram of fast r-cnn:
Compared with the r-cnn frame graph, we can find that there are two main differences: one is that the last volume layer is added with an ROI pooling layer; the other is that the loss function uses the multi task loss function, which directly adds the frame regression to the CNN network for training.
(1) In fact, the ROI pooling layer is a simplified version of spp-net. Spp-net uses pyramid maps of different sizes for each proposal, while the ROI pooling layer only needs to sample a 7 * 7 feature map. For the vgg16 network conv5 ﹣ 3, there are 512 feature maps, so all region proposals correspond to a 7 * 7 * 512 unique feature vector as the input of the full connection layer.
(2) The training process of r-cnn can be divided into three stages, while fast r-cnn directly uses softmax instead of SVM for classification, and uses multi task loss function frame regression to join in the network, so the whole network training process is end-to-end (excluding region proposal extraction stage).
(3) In the process of network fine-tuning, fast r-cnn also made some fine-tuning at the basic level of the circle, and achieved better detection results.
Summary: fast r-cnn combines the essence of r-cnn and spp-net, and introduces multi task loss function, which makes the training and testing of the whole network very convenient. In Pascal voc207 training set, the test result is 66.9% (map) in voc207. If voc207 + 2012 training set is used, the test result is 70% (the expansion of data set can greatly improve the target detection performance). It takes about 3S to use vgg16.
Disadvantages: selective search is used to extract the region proposal, and the target detection time is mostly small (region proposal 2~3s, while feature classification only needs 0.32s), which can not meet the real-time application, and does not realize the true sense of end-to-end training test (region proposal uses selective search to extract first). So is it possible to use CNN directly to generate region proposal and classify it? Faster r-cnn framework is a target detection framework that meets such needs.
4)Faster R-CNN(NIP2015)
(Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks)
In the framework of region proposal + CNN classification, the quality of region proposal directly affects the accuracy of target monitoring task. If a method is found to extract only a few hundred or fewer high-quality pre selection windows, and the recall rate is very high, it can not only accelerate the speed of target detection, but also improve the performance of target detection (less false positive cases). RPN (region proposal networks) came into being.
The core idea of RPN is to use convolutional neural network to generate region proposal directly, which is essentially a sliding window. The design of RPN is more ingenious. RPN only needs to slide once on the final convolution layer, because the region proposal with multi-scale aspect ratio can be obtained by the anchor mechanism and the border regression.
We directly look at the RPN network structure diagram (using ZF model) on the upper side. Given the input image (assuming a resolution of 600 * 1000), we can get the family and feature diagram of the last layer (about 40 * 60) through convolution operation. A 256 dimensional feature vector can be obtained by using 3 * 3 convolution on this feature map, followed by CLS layer and reg layer for classification and border regression respectively (similar to fast r-cnn, but here only two categories are target and Beijing). Each feature area corresponding to the 3 * 3 sliding window simultaneously predicts the region proposal of three scales (128256512) and three aspect ratios (1:1, 1:2, 2:1) of the input image. This mapping mechanism is called anchor. So for this 40 * 60 feature map, there are about 20000 (40 * 60 * 9) anchors in total, that is, 20000 region proposals are predicted.
What are the benefits of this design? Although now it is also a sliding window strategy, but: the sliding window operation is carried out on the convolution layer feature map, and the dimension is 16 * 16 times lower than the original image (after four times of 2 * 2 pooling operation in the middle); multi-scale uses 9 anchor, corresponding to three scales and three aspect ratios, plus the back edge is connected with the border regression, so even the windows outside the 9 anchor can also Get a region proposal close to the target.
Nip2015 version of fast r-cnn uses the detection framework of RPN Network + fast r-cnn network separation for target detection. The overall process is the same as fast r-cnn, except that region proposal is now extracted by RPN network (instead of the original selective search). At the same time, in order to make RPN network and fast r-cnn network realize the weight sharing at the basic level of the circle, the author trains RPN and fast R-cnn used four stages of training methods:
(1) Initialize the network parameters with the trained model on Imagenet, fine tune the RPN network;
(2) Using RPN in (1) to extract region proposal to train fast r-cnn network;
(3) Use (2) fast r-cnn network to reinitialize RPN and fine tune the fixed circle base;
(4) Fix the convolution layer of fast r-cnn in (2), and fine tune the network using the region proposal extracted from RPN in (3).
After weight sharing, RPN and fast r-cnn can improve the accuracy of target detection.
Using the trained RPN network, given the test image, you can directly get the region proposal after the edge regression. According to the category score of region proposal, you can sort the RPN network, and select the first 300 windows as the input of fast r-cnn for target detection. Using voc07 + 12 training set training, the test map of vioc 2007 test set reaches 73.2% (selective search + fast R-cnn is 70%), and the speed of target detection can reach 5 frames per second (selective search + fast r-cnn is 2-3s).
It should be noted that the latest version has combined the RPN network and fast r-cnn Network - the proposal obtained by RPN is directly practiced in the ROI pooling layer, which is the real framework for end-to-end target detection using CNN network.
Summary: faster r-cnn combines the region proposal and CNN classification which have been separated for a long time, uses the end-to-end network to detect the target, obtains the region proposal in advance, and then classifies each proposal. The calculation is still large. Fortunately, the appearance of object detection methods like Yolo makes real-time possible.
In general, along the way from r-cnn, spp-net, fast r-cnn and fast r-cnn, the process of object detection based on deep learning becomes more and more simplified, more and more accurate and faster. It can be said that the r-cnn series target detection method based on region proposal is the most important branch at present.
3、 Deep learning target detection algorithm based on regression method
At present, fast r-cnn is the mainstream target detection method, but its speed can not meet the real-time requirements. The importance of the methods such as Yolo gradually appears. This kind of methods use the idea of regression, that is, given the input image, directly return the target frame and target category of this position in multiple positions of the image.
1)YOLO(CVPR2016,oral)
(You Only Look Once: Unified, Real-Time Object Detection)
Let's take a look at the target detection flow chart of Yolo above:
(1) Given an input image, the image is first divided into 7 * 7 networks
(2) For each grid, we predict 2 borders (including the confidence that each border is the target and the probability that each border area is in multiple categories)
(3) According to the previous step, 7 * 7 * 2 target windows can be predicted, and then target windows with low possibility can be removed according to the threshold value. Finally, redundant windows can be removed by NMS.
It can be seen that the whole process is very simple, and the location and category can be determined by direct regression without the need for the middle region proposal to find the target online.
So how can we directly return the location and category information of the target on the grid of different locations? Above is the network structure diagram of Yolo. The front network structure is similar to googlenet's model. The main structure is the last two layers. The convolution layer is followed by a 4096 dimension full connection layer, and then the back is fully connected to a 7 * 7 * 30 dimension tensor. In fact, 7 * 7 is the number of divided networks. Now we need to predict the two possible positions of the target and the target confidence and category of this position on each grid, that is, each network predicts two targets. The information of each target includes 4-dimensional coordinate information (center point coordinate + length and width), the confidence degree of one event target, and the number of categories 20 (20 categories on VOC), which is a vector of (4 + 1) * 2 + 20 = 30 dimensions in total. In this way, the information needed for target detection (border information plus category) can be directly regressed on each grid by using the 4096 dimensional full graph feature of the front edge.
Summary: Yolo transforms the task of target detection into a regression problem, which greatly speeds up the detection speed and enables Yolo to process 45 images per second. Moreover, because each network uses full graph information to predict the target window, the proportion of false positive is greatly reduced (sufficient context information). But there are also problems with Yolo: without the region proposal mechanism, only using 7 * 7 grid regression will make the target not very accurate positioning, which also leads to the low detection accuracy of Yolo.
2) SSD
(SSD:Single Shot MultiBox Detector)
Based on the above analysis of the problems existing in Yolo, it is not very accurate to locate the target using the whole graph feature regression in the rough grid of 7 * 7. Is it possible to achieve more accurate positioning by combining the idea of region proposal? SSD combines the regression idea of Yolo and the anchor mechanism of faster r-cnn to achieve this.
The figure above is a frame diagram of SSD. First, the method of obtaining target location and category by SSD is the same as that of Yolo, which uses regression. However, Yolo uses the characteristics of the whole diagram to predict a location, and SSD uses the characteristics around the location (more reasonable feeling). So how to establish the corresponding relationship between a location and its characteristics? As you may have thought, use faster r-cnn's anchor mechanism, as shown in SSD's frame diagram. If the size of a layer's feature map (Figure b) is 8 * 8, then use the 3 * 3 sliding window to extract the features of each position, and then the coordinate information and category information obtained by this feature regression (Figure C).
Different from faster r-cnn, this anchor has multiple feature maps, which can take advantage of multi-layer features and naturally achieve multi-scale (different feature map3 * 3 sliding window experience fields of different layers are different).
Summary: SSD combines the regression idea of YOLO and the anchor mechanism of Faster R-CNN, and uses the multi-scale regional characteristics of all positions in the whole map to carry out regression, which not only maintains the characteristics of YOLO's fast speed, but also ensures that the predicted Faster R-CNN is as accurate. The mPAP of SSD can reach 72.1% on voc207, and the speed can reach 58 FPS on GPU.
Summary: the proposal of Yolo gives a new idea to target detection, and the performance of SSD shows us the real possibility of target detection in practical application.
4、 Improve target detection methods
R-cnn series of target detection framework and Yolo target detection framework give us two basic frameworks for target detection. In addition, based on these frameworks, researchers propose a series of methods to improve the performance of target detection.
(1) Hard negative mining
R-cnn uses the idea of hard sample mining in training SVM classifiers, but fast r-cnn and fast r-cnn do not use hard sample mining because of the use of end-to-end training strategy (only setting the proportion of positive and negative samples and random sampling).
Training region based object detectors with online hard example mining (Oracle) of cvpr2016
The hard example mining mechanism is embedded into SGD algorithm, which makes fast r-cnn automatically select appropriate region proposal as positive and negative example training according to the loss of region proposal during the training process. The experimental results show that the fast r-cnn algorithm using ohem (online hard example mining) mechanism can improve the map of voc207 and voc2012 by about 4%.
(2) Multi level feature fusion
Fast r-cnn and fast r-cnn both use the features of the last convolution layer to detect the target. However, the convolution layer features at the high level have lost a lot of details (pooling operation), so they are not very accurate in positioning. Some methods, such as HYPERNET, use CNN's multi-layer feature fusion for target detection, which not only uses the semantic information of high-level features, but also considers the detailed texture information of low-level features, so the target detection and positioning is more accurate.
(3) Use context information
When the region proposal feature is extracted for target detection, the detection effect is often better combined with the region proposal context information. (context information is used in the object detection via a multi region & sematicsegementation aware CNN model and inside outside net papers.)