In the previous sections, we have covered what is target detection and how to detect targets, as well as the concepts of sliding windows, bounding box, and IOU, non-maxima suppression.
Here will summarize the current target detection research results, and several classical target detection algorithms to summarize, this article is based on deep learning target detection, in the following sections, will be specific to each method.
Before the rise of target detection algorithm in depth learning, what is the traditional target detection algorithm?
Traditional target detection typically uses a sliding window frame, which consists of three steps:
- Using a sliding window of different sizes to frame a part of the image as a candidate area;
- Extracts the visual characteristics associated with the candidate area. For example, human face detection commonly used harr features, pedestrian detection and common target detection commonly used hog characteristics;
- Using classifiers for identification, such as the commonly used SVM model.
In traditional target detection, the multi-scale deformation component model, DPM (deformable part models) [13], is outstanding and continuously obtains the VOC (Visual Object Class) 2007 to 2009 detection champion, In 2010 its author Felzenszwalb Pedro was awarded the "Lifetime Achievement Award" by VOC. DPM sees objects as multiple components (such as the nose of a face, mouth, etc.) and describes objects in relation to parts, which is very much in line with the non-rigid characteristics of many objects in nature. DPM can be seen as a hog+svm extension, well inherited the advantages of both, in the face detection, pedestrian detection and other tasks have achieved good results, but DPM is relatively complex, the detection speed is also slow, and there are many improved methods. While everyone was in full swing to improve DPM performance, target detection based on deep learning turned out to be quickly covered by DPM, and many researchers who previously studied traditional object detection algorithms began to turn to deep learning.
After the development of target detection based on deep learning, the effect has been difficult to break through. For example, in the literature [6] the algorithm in the VOC 2007 test set on the map can only be more than 30%, the literature [7] Overfeat in the ILSVRC 2013 test set map can only reach 24.3%. 2013 R-cnn was born, the VOC 2007 test set of map was raised to 48%, 2014 by the modification of the network structure and soared to 66%, while the ILSVRC 2013 test set of the map has been promoted to 31.4%.
R-CNN is region-based convolutional neural networks abbreviation, Chinese translation is a region-based convolution neural network, is a combination of regional nomination (region proposal) and convolutional neural Network (CNN) target detection method. Ross Girshick in 2013, "Rich Feature hierarchies for accurate Object Detection and Semantic segmentation" [1] lays the groundwork for this sub-domain. , the subsequent version of this paper is published in CVPR 2014[2], and the periodical version is published in Pami 2015[3].
In fact, before R-CNN, there are many researchers have tried to use deep learning method to do target detection, including overfeat[7], but R-CNN is the first truly industrial-grade application of the solution, which is similar to the development of depth learning itself, neural networks, Convolutional networks are not new concepts, but it is no wonder that this century has suddenly become a real possibility, and once it is viable, it is no surprise that it is growing rapidly.
R-CNN This field is currently very active research, successively appeared r-cnn[1,2,3,18], spp-net[4,19], Fast r-cnn[14, 20], Faster r-cnn[5,21], r-fcn[16,24], yolo[ 15,22], ssd[17,23] and other studies. Ross Girshick as the field of pedigree is always the same as the existence of God, r-cnn, Fast r-cnn, Faster r-cnn, Yolo are related to him. In fact, these innovative work often combines traditional visual methods with deep learning, such as selective search (selective search) and image pyramid (Pyramid).
Deep learning-related target detection methods can also be broadly divided into two factions:
- Based on regional nominations, such as R-CNN, Spp-net, Fast r-cnn, Faster r-cnn, R-FCN;
- End-to-end (End-to-end), no region-nominated, such as YOLO, SSD.
At present, the method based on the regional nomination still prevails, but the end-to-end method has obvious advantages in speed, and the follow-up development remains to be seen.
A related study
As a review of target detection, this paper first looks at the widely used regional nomination in target detection--selective search, and the early work--overfeat of target detection with deep learning.
1. Selective searching (selective search)
The first step in target detection is to make a regional nomination (region proposal), which is to identify possible areas of interest (region of Interest, ROI). Regional nominations are similar to the field of optical character recognition (OCR) segmentation, OCR segmentation is commonly used in the segmentation method, in short, as far as possible to cut into small connected fields (such as small strokes, etc.), and then according to the adjacent blocks of some morphological characteristics of the merger. But target detection of the object compared to the field of OCR is very different, and the graphics are irregular, size, so to a certain extent, it can be said that regional nomination is more difficult than OCR segmentation of a problem.
Possible ways to nominate a region are:
- Slide the window. The sliding window is essentially a poor lifting method, using different scales and aspect ratios to bring out all the possible large and small blocks, and then send them to identify them and identify them with the probability that they will remain. Obviously, the complexity of such a method is too high, resulting in a lot of redundant candidate areas, in reality is not feasible.
- Rule block. On the basis of the exhaustive method, some pruning methods were used, and only fixed size and aspect ratio were selected. This is very effective in some specific application scenarios, such as the photo search app Small Ape search problem in the Chinese character detection, because the Chinese characters square, the aspect ratio is most consistent, so it is a more appropriate choice to make the regional nomination with the rule block. However, for the common target detection, the rule block still needs to access a lot of locations, the complexity is high.
- Selective search. From the machine learning point of view, the previous method recall is good, but the accuracy is not satisfactory, so the crux of the problem is how to effectively remove redundant candidate areas. In fact, the redundancy candidate areas are mostly overlapping, selective search uses this, the bottom-up merging adjacent overlapping areas, thereby reducing redundancy.
Regional nominations are not the only three methods mentioned above, in fact this piece is very flexible, so there are many varieties, interested readers may wish to refer to the literature [12].
The specific algorithm details of the selective search [8] are shown in algorithm 1. In general, selective search is the iterative process of merging candidate regions from bottom to top.
input: One picture output: Candidate target position set L algorithm:1: Using the cross-segmentation method to obtain the candidate region set r ={r1,r2,..., rn}2: Initializes a similar set of S = ?3: foreach neighbor Zone pair (RI,RJ) Do4: Calculation of similarity S (RI,RJ)5: S =s∪s (RI,RJ)6: whileS not=? Do7: Get the maximum similarity s (ri,rj) =Max (S)8: Merge the corresponding area RT =RI∪RJ9: Removes all the similarity of the ri corresponding: S = s\s (ri,r*)Ten: Removes all the similarities of RJ: S = s\s (r*, RJ) One: Calculating the similarity of RT corresponding to the set St A: S =s∪st -: R =R∪rt -: L = Border of all regions in R
From the algorithm is not difficult to see, the area ofR is merged, so reduce a lot of redundancy, the equivalent of the increase in accuracy, but do not forget that we also need to continue to guarantee the recall rate, so the algorithm 1 in the similarity calculation strategy is very critical. If you simply adopt a strategy, it is easy to mistakenly merge areas that are not similar, for example, when you consider only outlines, areas of different colors can easily be merged by mistake. Selective search uses a variety of strategies to increase candidate areas to ensure recall, such as color space considering RGB, grayscale, HSV and its variants, the similarity calculation takes into account both color similarity and texture, size, overlap and so on.
In general, selective search is a relatively naïve area nomination method, which is widely used by the early method of target detection based on deep learning (including overfeat and R-CNN, etc.), but is deprecated by the current new method.
2, Overfeat
OVERFEAT[7][9] is a common use of CNN for classification, positioning and testing of the classic, the author is one of the deep learning of the great God ———— Yann LeCun in the New York University team. [10] Overfeat is also the winner of ILSVRC 2013 Task 3 (category + positioning).
Overfeat's core ideas are three points:
- Regional nomination: A sliding window combining sliding window and rule block, i.e. multi-scale (multi-scale);
- Classification and positioning: Unified using CNN to classify and predict the border position, model and ALEXNET[12] similar, wherein 1-5 layer is a feature extraction layer, the picture is converted to a fixed dimension of the feature vector, 6-9 layers for the classification layer (classified tasks dedicated), different tasks (classification, positioning, Detection) common feature extraction layer (layer 1-5), replacing only 6-9 layers;
- Cumulative: Because a sliding window is used, the same target object will have multiple positions, that is, multiple perspectives, because with multiple scales, the same target object will have multiple blocks of varying sizes. These different positions and the classification confidence on different size blocks accumulate, making the decision more accurate.
The key steps of Overfeat are four steps:
- Use the sliding window to make regional nominations at different scales, and then use the CNN model to classify each region to get the category and confidence level. It can be seen that the number and type of target objects detected differ greatly when different scale scales are used.
- The use of multi-scale sliding window to increase the number of detection, improve the classification effect;
- Using regression model to predict the position of each object, it is seen that the larger picture, the more the border number;
Overfeat is the early work of CNN used to do target detection, the main idea is to use a multi-scale sliding window to do classification, positioning and detection, although it is a number of tasks but reuse the model front layer, this model reuse of the idea is later R-CNN series continue to follow and improve the classic practice.
Of course Overfeat also have a lot of shortcomings, at least the speed and effect have a lot of room for improvement, the back of the R-CNN series in these two areas to do a lot of promotion.
Ii. method based on regional nomination
This section focuses on regional nomination methods, including R-CNN, Spp-net, Fast r-cnn, Faster r-cnn, and R-FCN.
1, R-CNN
As mentioned earlier, the early detection of targets, mostly using sliding window of the way to the window nomination, this approach is essentially poor lifting method, r-cnn[1,2,3] using selective Search.
The following are the main steps of R-CNN:
- Regional nomination: Extracting 2000 or so region candidate boxes from the original image by selective search;
- Area size Normalization: All candidate boxes are scaled to a fixed size (227x227,alexnet network is used in the original text);
- Feature extraction: Extracting features through the CNN network;
- Classification and regression: On the basis of the feature layer to add two fully connected layer, and then using SVM classification to do recognition, with linear regression to fine-tune the border position and size, where each category training a border regression.
The structure of the target detection system as shown, note that the 2nd step in the diagram of the corresponding steps of 1, 2 steps, that is, including regional nomination and regional size normalization.
Overfeat can be seen as a special case of r-cnn, just to change the selective search to a multi-scale sliding window, each category of border regression to a unified border regression, SVM for multi-layer network can be. But Overfeat is actually 9 times times faster than R-CNN, mainly due to convolution-related shared computing.
In fact, R-CNN has many drawbacks:
- Repeat calculation: R-CNN is not exhaustive, but there are still about 2000 candidate boxes, these candidate boxes need to do the CNN operation, the calculation is still very large, many of which are in fact repeated calculation;
- SVM model: And it is a linear model, it is obviously not the best choice when labeling data is not missing;
- Training test is divided into multiple steps: Regional nomination, feature extraction, classification, regression are disconnected training process, intermediate data also need to be saved separately;
- The space and time cost of training is high: the characteristics of the convolution need to exist on the hard disk, these characteristics require hundreds of g of storage space;
- Slow: The previous disadvantage eventually caused r-cnn to be surprisingly slow, it takes 13 seconds to process a picture on the GPU, and 53 seconds on the CPU [2].
Of course, r-cnn this time is directed to the effect, wherein the ILSVRC 2013 data set on the map from the Overfeat 24.3% to 31.4%, the first time a qualitative change.
28th, a survey of target detection algorithms based on deep learning