Reference Link:
http://blog.csdn.net/tangwei2014
This is after rcnn,fast-rcnn and FASTER-RCNN, RBG (Ross girshick) Great God, another masterpiece, played a very entertaining name:YOLO.
Although the current version has some mishap, but the current based on the DL detection of a big pain point, is the speed problem.
Its enhanced version of the GPU can run 45fps, simplifying version 155fps.
Paper Download: http://arxiv.org/abs/1506.02640
Code Download: https://github.com/pjreddie/darknet
This blog post focus on the method. The results of the experiment and so on the whole again to serve.
1. Yolo's core ideas
Yolo's core idea is to use the entire graph as the input of the network, directly in the output layer to return to bounding box position and bounding box belongs to the category.
Remember the words faster RCNN also directly with the entire picture as input, but faster-rcnn the whole or the rcnn that proposal+classifier idea, but is to extract the proposal step in the implementation of CNN.
2.How to implement YOLO
- Divides an image into an SxS grid (grid cell), which isresponsible for predicting the object if the center of an object falls in the grid.
Each grid predicts b bounding box, and each bounding box has a confidence value to predict, in addition to its position.
This confidence represents the confidence of the object contained in the predicted box and the quasi-double information of the box prediction, the value of which is calculated as follows:
Where an object falls in a grid cell, the first item takes 1, otherwise it takes 0. The second item is the predicted IOU value between the bounding box and the actual Groundtruth.
Each bounding box wants to predict (x, Y, W, h) and confidence a total of 5 values, and each grid also predicts a category information, which is recorded in Class C. Then the SxS grid, each grid to predict the B bounding box also predicts the C categories. The output is a tensor of S X S X (5*b+c).
Note: The class information is for each grid, and the confidence information is for each bounding box.
For example: in Pascal VOC, the image input is 448x448, take s=7,b=2, a total of 20 categories (C=20). Then the output is a tensor of 7x7x30.
The entire network structure looks like this:
At test time, the class information for each grid prediction is multiplied by the confidence information bounding box predicts to get the class-specific confidence score for each bounding box:
The first item on the left of the equation is the category information for each grid prediction, and the 23rd is the confidence for each bounding box prediction. This product is encode the probability that the predicted box belongs to a certain class, and also the information about the box's accuracy.
After getting each box's class-specific confidence score, set the threshold, filter out the low-score boxes, and NMS process the reserved boxes to get the final test results.
3.implementation details of the YOLO
Each grid has 30 dimensions, 30 dimensions, 8 dimensions are the coordinates of the return box, 2 dimensions are box's confidence, and 20 dimensions are categories.
The coordinates of x, y with the corresponding mesh offset normalized to 0-1, w,h with the image width and height normalized to 0-1.
in its implementation, The main thing is how to design the loss function, so that the three aspects of a good balance. The author uses sum-squared error loss to do this simply and rudely.  
This approach has the following problems:
First, 8-dimensional localization Error and 20-dimensional classification error are equally important and obviously unreasonable;  
second, if there is no object in a grid (a lot of this mesh in a picture), The confidence of the box in these grids is then push to 0, which is overpowering compared to fewer meshes with an object, which can cause network instability and even divergence.  
Workaround:
- more emphasis on 8-dimensional coordinate predictions, giving these losses a greater loss weight, Take 5 in Pascal VOC training.
-
-
For box predictions of different sizes, small box predictions are definitely more intolerable than big box predictions. The same offset loss is the same in Sum-square error loss.
In order to alleviate this problem, the author used a more trickery method, that is, the width and height of box to take the square root instead of the original height and width. This reference to the following diagram is easy to understand, small box axis value is small, when the shift occurs, the reaction to the y-axis is larger than the large box.
One grid predicts multiple boxes, and the hope is that each box predictor is specifically responsible for predicting an object. The specific approach is to see the current forecast box and ground Truth box which IOU big, which is responsible for which. This practice is called the specialization of box predictor.
- Finally the entire loss function is as follows:
In this loss function:
- Classification error is punished only when there is an object in a grid.
- The coordinate error of box is punished only when a box predictor is responsible for a ground truth box, while the ground truth box is responsible for the forecast value and ground truth Box's IOU is not the largest of all box in that cell.
- Other details, such as using the leak RELU with the activation function, the model with imagenet pre-training and so on, are not mentioned here.
4.Disadvantages of Yolo
Yolo to each other close to the object, but also very small group detection effect is not good, this is because a grid only predicted two boxes , and only belong to a class.
For the test image, the new uncommon aspect ratio and other cases of the same class of objects are present. The generalization ability is weak.
Due to the problem of loss function, the location error is the main reason for the effect of detection. In particular, the handling of large and small objects has yet to be strengthened.
Rcnn Learning Note (6): Once (YOLO): Unified, real-time Object Detection