Understanding of the Yolo of target detection method (II.)

Last Update:2018-07-26 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is reproduced from:

http://blog.csdn.net/u011534057/article/details/51244354

Reference Link:
http://blog.csdn.NET/tangwei2014

This is after rcnn,fast-rcnn and FASTER-RCNN, RBG (Ross girshick) Great God, another masterpiece, played a very entertaining name: YOLO.
Although the current version has some mishap, but the current based on the DL detection of a big pain point, is the speed problem.
Its enhanced version of the GPU can run 45fps, simplifying version 155fps.

Paper Download: http://arxiv.org/abs/1506.02640
Code Download: https://github.com/pjreddie/darknet

This blog post focus on the method. The results of the experiment and so on the whole again to serve.
1. Yolo's core ideas

Yolo's core idea is to use the entire graph as the input of the network, directly in the output layer to return to bounding box position and bounding box belongs to the category.

Remember the words faster RCNN also directly with the entire picture as input, but faster-rcnn the whole or the rcnn that proposal+classifier idea, but is to extract the proposal step in the implementation of CNN.

The 2.YOLO implementation method divides an image into an SxS grid (grid cell), which is responsible for predicting the object if the center of an object falls in the grid.

Each grid predicts B bounding box, and each bounding box has a confidence value to predict, in addition to its position.
This confidence represents the confidence of the object contained in the predicted box and the quasi-double information of the box prediction, the value of which is calculated as follows:

Where an object falls in a grid cell, the first item takes 1, otherwise it takes 0. The second item is the predicted IOU value between the bounding box and the actual Groundtruth.

Each bounding box wants to predict (x, Y, W, h) and confidence a total of 5 values, and each grid also predicts a category information, which is recorded in Class C. Then the SxS grid, each grid to predict the B bounding box also predicts the C categories. The output is a tensor of S X S X (5*b+c).
Note: The class information is for each grid, and the confidence information is for each bounding box.

For example: In Pascal VOC, the image input is 448x448, take s=7,b=2, a total of 20 categories (C=20). Then the output is a tensor of 7x7x30.
The entire network structure is shown in the following diagram:

At test time, the class information for each grid prediction is multiplied by the confidence information bounding box predicts to get the class-specific confidence score for each bounding box:

The first item on the left of the equation is the category information for each grid prediction, and the 23rd is the confidence for each bounding box prediction. This product is encode the probability that the predicted box belongs to a certain class, and also the information about the box's accuracy.

After getting each box's class-specific confidence score, set the threshold, filter out the low-score boxes, and NMS process the reserved boxes to get the final test results.

Implementation details for the 3.YOLO

Each grid has 30 dimensions, 30 dimensions, 8 dimensions are the coordinates of the return box, 2 dimensions are box's confidence, and 20 dimensions are categories.
The coordinates of x, y with the corresponding mesh offset normalized to 0-1, w,h with the image width and height normalized to 0-1.

In the implementation, the most important is how to design the loss function, so that the three aspects of a good balance. The author uses the sum-squared error loss to do this simply and rudely.
There are several issues with this approach:
First, the 8-dimensional localization error and 20-dimensional classification error are equally important and obviously unreasonable;
Second, if there is no object in a grid (a lot of this grid in a graph), then the confidence of the box in these grids will be push to 0, which is overpowering compared to fewer meshes with object. This can lead to network instability and even divergence.
Solution: More attention to the 8-dimensional coordinate prediction, to these losses before the greater loss weight, recorded in Pascal VOC training to take 5. For the confidence loss without the object, give the small loss weight, which is recorded as 0.5 in Pascal VOC training. There is an object box for the confidence loss and the category of the loss loss weight normal fetch 1.

For box predictions of different sizes, small box predictions are definitely more intolerable than big box predictions. The same offset loss is the same in Sum-square error loss.
In order to alleviate this problem, the author used a more trickery method, that is, the width and height of box to take the square root instead of the original height and width. This reference to the following diagram is easy to understand, small box axis value is small, when the shift occurs, the reaction to the y-axis is larger than the large box.

One grid predicts multiple boxes, and the hope is that each box predictor is specifically responsible for predicting an object. The specific approach is to see the current forecast box and ground Truth box which IOU big, which is responsible for which. This practice is called the specialization of box predictor. Finally the entire loss function is as follows:

In this loss function:
Classification error is punished only when there is an object in a grid. The coordinate error of box is punished only when a box predictor is responsible for a ground truth box, while the ground truth box is responsible for the forecast value and ground truth Box's IOU is not the largest of all box in that cell. Other details, such as using the leak RELU with the activation function, the model with imagenet pre-training and so on, are not mentioned here.

Disadvantages of 4.YOLO

Yolo to each other close to the object, but also very small group detection effect is not good, this is because a grid only predicted two boxes, and only belong to a class.

For the test image, the new uncommon aspect ratio and other cases of the same class of objects are present. The generalization ability is weak.

Due to the problem of loss function, the location error is the main reason for the effect of detection. In particular, the handling of large and small objects has yet to be strengthened.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More