Transferred from: http://lanbing510.info/2017/08/28/YOLO-SSD.html
Prior to the emergence of deep learning, the traditional target detection method is probably divided into regional selection (sliding window), feature extraction (SIFT, hog, etc.), classifier (SVM, adaboost, etc.) three parts, the main problems have two aspects: on the one hand, sliding window selection strategy is not targeted, time complexity, window redundancy On the other hand, the characteristics of manual design are poor. Since the emergence of deep learning, target detection has made a huge breakthrough, the most notable two directions are: 1 RCNN-based deep learning target detection algorithm (RCNN,SPP-NET,FAST-RCNN,FASTER-RCNN, etc.), which is represented by proposal; 2 The algorithm of deep learning target detection (YOLO,SSD, etc.), which is represented by YOLO, is based on regression method. This article introduces the deep learning target detection algorithm based on region proposal, this paper introduces the method of deep learning target detection based on regression method (YOLO,SSD), the former does not meet the real-time requirement in speed, the latter uses the thought of regression (both given the input image, The target border of this position and the target category are regressed directly in multiple positions of the image, which greatly accelerates the detection speed. YOLO
algorithm features
1 The object detection is solved as a regression problem. Based on a separate end-to-end network, the output from the original image to the position and category of the object is completed, the input image is inference once, and the position of all objects in the image and their category and the corresponding confidence probability can be obtained.
2 YOLO Network uses the GOOGLENET classification network structure. The difference is that YOLO does not use the inception Module, but instead uses the 1*1 convolution layer (where the presence of the 1*1 convolutional layer is for cross-channel information consolidation) +3*3 a simple substitution of convolutional layers.
3 Fast YOLO uses 9 convolutional layers instead of 24 yolo, the network is more brisk, the speed from the Yolo 45fps to 155fps, but also lost the detection accuracy.
4 Use full-image as context information, background error (the background is wrong to think of objects) less.
5 Strong generalization ability. The results of good training on natural images still have a good effect in the works of art. Network Structure
A few points
First, the general process
1 give an input image, first dividing the image into a 7*7 grid.
2 for each grid, we predict 2 borders (including the confidence level of each border is the target and the probability of each border area on more than one category).
3 based on the previous step, you can predict the 7*7*2 target window, then remove the target window with lower probability based on the threshold, and finally the NMS will remove the redundant window.
Second, training
1 Pre-training classification network: Pre-training A classification network on the ImageNet 1000-class competition dataset, which is the first 20 winder network in the former network structure +average-pooling layer+fully Connected Layer (at this point the network input is 224*224).
2 Training Detection Network: The literature [6] mentions that increasing the convolution and the full link layer in a pre-trained network can improve performance. Yolo adds 4 convolution layers and 2 full-link layers, randomly initializing weights. The detection requires fine-grained visual information, so the network input is also changed from 224*224 to 448*448.
(1) A picture is divided into 7*7 meshes, where the center of an object falls in the grid and the grid is responsible for predicting the object. Each grid predicts two bounding Box. The grid is responsible for the category information, bounding box is responsible for coordinate information (4 coordinate information and a confidence level), so the last layer of output is 7*7* (4+1) +20) =7*7*30 dimension.
(2) The coordinates of the bounding box are normalized using the size of the image 0-1. Confidence is calculated using the PR (object) ∗IOUTRUTHPREDPR (object) ∗ioupredtruth, where the first item indicates whether an object falls in the grid, and the second item represents the IOU value between the predicted box and the actual box.
3 Loss Function Determination: The loss function is defined as follows, the design goal of the loss function is to make the coordinates, the confidence and the category of three aspects to achieve a good balance. Simple all using the sum-squared Error loss to do this thing will have the following deficiencies: ①8-dimensional localization error and 20-dimensional classification error is equally important is obviously unreasonable; ② If there is no object in a grid (many of these meshes in a graph), then the confidence of the box in these grids will be push to 0, which is overpowering compared to fewer meshes with an object, which can lead to instability and even divergence of the network. The solution is as follows.
(1) More attention to 8-dimensional coordinate prediction, to these losses before the greater loss Weight, recorded as Λcoordλcoord, in Pascal VOC training 5. (The blue box above).
(2) to the Bbox confidence Loss without object, give small Loss Weight, recorded as Λnoobjλnoobj, in Pascal VOC training to take 0.5. (pictured above orange box).
(3) The Loss weight of the bbox confidence Loss (upper picture red box) and the category Loss (above the purple box) have the normal fetch 1.
(4) for different size of bbox prediction, compared to the large bbox prediction bias point, small bbox prediction biased a little more unbearable. The same offset loss is the same in Sum-square Error loss. To alleviate this problem, replace the width and height of the bbox with the square root of the original height and width. As shown below: The Small Bbox has a smaller horizontal axis value, and when offset occurs, the loss on the y-axis (the green below) is larger than the big Bbox (lower red).
(5) A grid predicts multiple bbox, and in training we want each object (Ground True box) to have only one bbox specifically responsible (one object and one bbox). The specific practice is with ground true box (object) IOU the largest bbox is responsible for the prediction of the Ground True Box (object). This practice is called Bbox Predictor's specialization (full-time). Each predictor will be better for a specific (Sizes,aspect Ratio or classed of Object) ground True box prediction.
Third, testing
1 calculates the class-specific Confidence score per Bbox: class information for each grid forecast (Pr (classi| Object) Pr (classi| Object) and Bbox predictive Confidence information (Pr (object) ∗ioutruth