"CV paper reading" yolo:unified, real-time Object Detection

Source: Internet
Author: User

One of the major features of YOLO is that it is fast and can be completely real-time in processing. The reason is that the whole detection method is very concise, using regression method, directly in the original image of the target detection and positioning.

Multi-Task detection:

The network unifies the target detection and localization into a deep network, and can detect multiple objects simultaneously on the original image. The steps are summarized as follows:

(1) Divide the picture into s*s squares, if the midpoint of an object falls in one of the squares, then the square is responsible for this object. The midpoint of the object that is said here should refer to the center of the object in the Ground Truth box.

(2) For each lattice, predict the B bounding box and the corresponding confidence. Bounding box how to choose to let me think about, recalled faster CNN, on the network forecast bounding box will be proportional and size, but here is not needed, because faster CNN in the choice of box is actually regoin Part of the proposal, while the box in YOLO is calculated directly from the regression equation. Here, the calculation of confidence consists of two parts: when there is no object in the lattice, then PR (object) = 0, otherwise equal to 1. As you can see from the equation, it contains information about the existence of objects and the accuracy of the predictions two. In addition, for bounding box there are four coordinates, x, Y, W, H.

(3) For each lattice containing the object to predict the probability information of the C category, and each lattice to predict the B box, so that the final will be s*s* (b*5+c) tensor.

The above is part of the training according to Ground Truth box.

(4) in test, predicted the s*s* (b*5+c) tensor, the class probability and confidence multiplied,

You get the class–specific confidence score for each bounding box.

(5) According to the above score set threshold to filter, and then perform NMS processing, to obtain the final detection results.

Network structure:

The network structure is very similar to googlenet, which uses the 1*1 convolution kernel compression information to construct more nonlinear abstract features, because it is equivalent to the function of multi-layer perceptron. The structure of the paper, in terms of the number of channels, the middle should be missing some convolution layer.

Some detail questions:

Pre-training: using imagenet pre-training, the network structure is the first 20-tier network plus an average pool layer and an all-connected layer.

Prediction: Because the predictions require finer pixels, the inputs are expanded into 448*448, and four convolution layers and two fully connected layers are added. In the last layer of prediction, the predicted probabilities and bounding box are required, where the predictions for bounding box are normalized to 0 to 1.

Activation function: The last layer of the activation function uses a linear activation function, while the other layers use the Leaky relu activation function:

Error propagation: The error is calculated using a simple square and error function. However, it can be known from the network structure that the dimensionality of the predicted probability is higher than the bounding box, and that most of the squares in the picture have no objects, which makes their confidence tend to be 0. Their contributions are too large to allow the network to converge.

In this paper, a method is weighted, giving different weights, for predicting bounding box, and for the error of the lattice without object to give the weight value. At the same time, the small error of the large box is certainly smaller than the small box error, so, the use of w,h,x,y square root approach, because the square root function of the image with the increase of X will become flat.

Also (not quite clear here), a grid might predict multiple boxes, expecting each box to be responsible for specific object predictions. The way is, for an object truth box, see which bounding box IOU bigger, let it be responsible for this box. I estimate that this responsible allocation will change dynamically as each network update is selected. The premise is that the center of object falls in that lattice, and the formula is:

Which corresponds to the lattice I if there is an object, the corresponding bounding box J is responsible for the prediction of this object. Indicates whether there are objects in lattice I.

Training methods: Using the random gradient descent method, as well as the dropout method.

Disadvantages:

(1) For objects that are close together, there is a small prediction of poor population. This is because lattice prediction boxes belong to a class, and often lattice large, not fine.

(2) for unusual aspect ratio objects, the generalization ability is weak

(3) The error function affects the accuracy of positioning.

"CV paper reading" yolo:unified, real-time Object Detection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.