Yolo:you only look once:unified, real-time Object Detection
The content of this paper is not many, the core idea is relatively simple, the following is equivalent to the translation of the paper.
Yolo is a convolutional neural network that can predict multiple box positions and classes at once, enabling end-to-end detection and recognition of targets with the greatest advantage of being fast . In fact, the essence of target detection is regression, so a CNN that implements regression does not need a complex design process. Yolo does not choose sliding window or extracting proposal way to train the network, but directly selects the whole graph training model. The advantage of this is that you can better distinguish between the target and the background area, in contrast, the fast-r-cnn with proposal training methods often mistakenly detect the background area as a specific target . Of course, YOLO has sacrificed some precision while lifting the detection speed. The following figure shows the YOLO detection system flow: 1. Resize the image to 448*448;2. Run cnn;3. Non-maximal suppression optimization test results. Interested children shoes can be installed according to http://pjreddie.com/darknet/install/instructions to test the YOLO scoring process, very easy to get started. Next, we will focus on the principle of YOLO.
5.1 Integrated Inspection Solutions
Yolo's design philosophy follows end-to-end training and real-time detection. Yolo divides the input image into s*s meshes, and if the center of an object falls within a grid (cell), the corresponding mesh is responsible for detecting the object . During training and testing, each network predicts B bounding boxes, each bounding box corresponds to 5 predictor parameters, which is the center point coordinates (x, y), Width height (w,h), and confidence score of the bounding box. The confidence score here (pr (object) *iou (Pred|truth)) comprehensively reflects the probability of the target location based on the current model bounding the presence of the object in the box, PR (object) and bounding box predicting the accuracy of the target position IOU (pred |truth). If there is no object in the Bouding box, the PR (object) = 0. If there is an object, the IOU is computed based on the predicted bounding box and the real bounding box, while the object is predicted to belong to a certain class of posterior probability PR (class_i| Object). Assuming that there is a common Class C object, then each mesh only predicts the conditional class probability of a Class C object Pr (class_i| Object), i=1,2,..., C; each grid predicts the position of the B bounding box. That is, the B bounding box shares a set of conditional class probability PR (class_i| Object), i=1,2,..., C. Based on the calculated PR (class_i| Object), a bounding box class-related confidence can be calculated at the time of testing: Pr (class_i| Object) *pr (object) *iou (Pred|truth) =PR (class_i) *iou (Pred|truth). If the input image is divided into a 7*7 mesh (s=7), each grid predicts 2 bounding box (b=2), there are 20 types of targets to be detected (C=20), it is equivalent to the final prediction of a length of s*s* (b*5+c) =7*7*30 vector, so as to complete the detection + recognition task, The entire process can be understood through the following diagram.
5.1.1 Network Design
YOLO Network design follows the Googlenet thought, but differs from it. The YOLO uses 24 cascaded convolution (CONV) layers and 2 fully connected (FC) layers , where the conv layer includes 3*3 and 1*1 two kernel, and the last FC layer is the output of the YOLO network with a length of s*s* (B*5+c) =7*7*30. In addition , the author also designed a simplified version of the Yolo-small network, including 9 cascaded conv layers and 2 FC layers, because the number of conv layer is much less, so yolo-small faster than YOLO much faster. As shown in the figure below, we give the architecture of the YOLO Network.
5.1.2 Training
The author trains The YOLO Network in steps: First, the author takes the first 20 conv layers from the network above, then adds a average pooling layer and an FC layer, using 1000 classes of imagenet data and training. The accuracy of TOP5 obtained by using 224*224d image training on ImageNet2012 is 88%. The author then adds 4 new conv layers and 2 FC layers after 20 pre-trained conv layers, and initializes these newly added layers with the following parameters, and when the new layer is fine-tune, the author chooses 448*448 image training. The last FC layer predicts the probability of objects belonging to different classes and bounding box center point coordinates x, Y, and w,h. The width of the boundingbox is relative to the image width and height of the resulting, bounding box's center coordinates are relative to a grid of position coordinates are normalized, so x,y,w,h are between 0 to 1.
When designing the loss function , there are two main problems: 1. For the last layer length is the 7*7*30 length prediction result, the calculation forecast loss usually chooses the square sum error. However, the positional error and the classification error of this loss function are 1:1 relations. 2. The entire figure has 7*7 mesh, most of the mesh does not actually contain objects (when the center of the object is located in the grid to calculate the inclusion of objects), if only the PR (class_i), a lot of grid classification probability is 0, grid loss shows the characteristics of sparse matrix, so that the loss convergence effect is poor, the model is unstable. To solve the above problems, the author adopts a series of schemes:
1. Increase the loss weight of bounding box coordinate forecast, reduce the loss weight of bounding box classification. The weights of coordinate prediction and classification prediction are λcoord=5,λnoobj=0.5 respectively.
2. The squared sum error is the same for large and small bounding box weights, in order to reduce the variance of bounding box width and height predicted by different sizes, the square root form is used to calculate the width and height prediction loss, i.e. sqrt (w) and sqrt (h).
Training loss composition form is more complex, here do not enumerate, if interested can refer to the author's original text slowly understand the experience.
5.1.3 Test
The author chooses pasal VOC image test training to get the YOLO network, each picture will predict 98 (7*7*2) bouding box and the corresponding class probability. Usually a cell can directly predict an object's corresponding bounding box, but for some objects larger or closer to the boundary of the image, the result of multiple mesh predictions is generated by non-maximal suppression processing. Although the YOLO relies less on R-CNN and DPM for non-maximum inhibition, non-maximum suppression can actually increase the map by 2 to 3 points.
5.2 Method Comparison
The author compares the Yolo target detection and recognition method with several other classical schemes:
DPM (deformable parts models): DPM is a method of target detection based on sliding window, and the basic process includes several independent links: feature extraction, Region division, and bounding box based on high-score region prediction. Yolo uses an end-to-end training approach, which connects feature extraction, candidate frame prediction, non-maximal suppression and target recognition to achieve a faster and more accurate detection model.
The R-CNN:R-CNN scheme must first extract the proposal with the Seletivesearch method, then use CNN to extract the feature and finally train the classifier with SVM. Such a scheme, prudential cumbersome also. The essence of YOLO is similar, but the proposal and target recognition are extracted by means of shared convolution feature. In addition, YOLO use grid to proposal space constraints, to avoid in some areas of the repeated extraction proposal, compared to Seletivesearch extraction 2000 proposal for R-CNN training, YOLO only need to extract 98 proposal, How can training and testing speed be unpleasant.
FAST-R-CNN, FASTER-R-CNN, FAST-DPM:FAST-R-CNN, and FASTER-R-CNN replaced SVMs Training and Selectiveseach method of proposal extraction, respectively, Speed up training and testing to some extent, but its speed is still not comparable with YOLO. In the same vein, DPM optimizations are implemented on the GPU without YOLO on the right.
5.3 Experiments
5.3.1 Real-time detection and recognition system comparison
Comparison of accuracy rate of 5.3.2 VOC2007
5.3.3 FAST-R-CNN and YOLO Error analysis
As shown in the illustration, different regions represent different indicators:
Correct: Correct detection and recognition of the proportions, namely the correct classification and iou>0.5
Localization: Classified correctly, but 0.1<iou<0.5
Similar: Similar in category, iou>0.1
Other: Classification error, iou>0.1
Background: for any target iou<0.1
As you can see, the YOLO is less accurate than fast-r-cnn when locating the target position. In Yolo's error, the target location error occupies the largest proportion, which is 10 points higher than the FAST-R-CNN. However, YOLO is more accurate in locating the background, and it can be seen that fast-r-cnn false positives are high (background=13.6% that a box is a target but does not actually contain any objects).
Comparison of accuracy rate of 5.3.4 VOC2012
Because the YOLO is more obvious in the process of target detection and recognition, the author designs The Fast-r-cnn+yolo detection and recognition mode, which is to extract a set of bounding box first with R-CNN, and then use YOLO to process the image to get a set of bounding box. Compare the two sets of bounding box is basically consistent, if the probability of the YOLO calculated by the same target classification, the final bouding box area to select the intersection of the region. The maximum accuracy of the fast-r-cnn can be up to 71.8% and the Fast-r-cnn+yolo can be used to increase the accuracy rate to 75%. This increase in accuracy is based on the YOLO error on the Test side, unlike the fast-r-cnn. Although the Fast-r-cnn_yolo improves accuracy, the corresponding detection rate is greatly reduced, resulting in the inability to detect in real time.
Using VOC2012 to test the map=57.9% of mean Average Precision,yolo for different algorithms, this value is equivalent to VGG16-based rcnn detection algorithm. For the test results of different size images, the authors found that the accuracy rate of YOLO in detecting small targets is about 8~10% than R-CNN, and the accuracy rate is higher than r-cnn in the detection of large targets. The accuracy of Fast-r-cnn+yolo is the highest, and the accuracy rate is 2.3% higher than that of FAST-R-CNN.
5.4 Summary
Yolo is a convolutional neural network that supports end-to-end training and testing, and can detect and recognize multiple targets in images under the premise of guaranteeing certain accuracy.