Now let's talk about the second generation after YOLO, this second generation has done a lot of optimization on the basis of the first generation. The original version has a lack of accuracy, speed, and fault tolerance. In order to improve on this, the authors have adopted those methods. This article first says accuracy.
One, more accurate (Better)
1, Batch normalization (batch regularization)
First of all to understand the neural network normalization, usually in the neural network training before the beginning of the input data to do a normalization, then the specific why need normalization? What are the benefits of normalization? The reason is that the neural network learning process is essentially to learn the data distribution, Once the training data and test data distribution is different, the network generalization ability is greatly reduced, on the other hand, once each batch of training data distribution varies (batch gradient decline), then the network will be in each iteration to learn to adapt to different distributions, which will greatly reduce the network training speed, This is why we need to do a normalization of the data. Training for deep networks is a complex process that, as long as there are minor changes in the front layers of the network, the subsequent layers are amplified. Once the network layer of the input data distribution changes, then this layer of network needs to adapt to learning the new data distribution, so if training process, the distribution of training data has been changing, it will affect the training speed of the network. We tend to "whiten" the input data so that it has a mean value of 0 and a variance of 1. However, the subsequent layer is difficult to guarantee, because with the adjustment of the front layer parameters, the input of the subsequent layers is difficult to guarantee. The bad situation is, for example, the last layer, after a minibatch, the parameters adjusted better than before, but the parameters of all the layers before it has changed, resulting in the next round of training when the input range has changed, then it is certainly difficult to classify the correct.
(What is mini-batch.) Batch gradient descent each iteration requires the participation of all samples, and for large-scale machine learning applications, there are often billion-level training sets, with very high computational complexity. Therefore, some scholars suggest that the training set is just a sampling set of data distribution, we can only use a subset of the training set in each iteration. This is the Mini-batch algorithm. Assuming that the training set has m samples, each mini-batch (a subset of the training set) has a B sample, then the entire training set can be divided into m/b mini-batch
That's what batch normalization is, and it's plain that the output of each convolutional layer in the neural network is normalized, not after pooling and activating the function. But this also poses a problem, limiting the output of a layer to a distribution with an average of 0 variance of 1, which will weaken the network's expressive power. Therefore, the author also gives the batch normalization layer some limited relaxation, adding two learning parameters β and γ, zooming and panning the data, translating the parameter β and scaling parameters γ is learned. Extreme case These two parameters are equal to the mean and variance of mini-batch, then the data after batch normalization is exactly the same as the input, of course, the general situation is different.
This paper uses a normalization method similar to Z-score: each dimension minus its own mean, divided by its own standard deviation, because the use of the random gradient descent method, these mean and variance can only be calculated in the current iteration of batch, so the author named the algorithm batch normalization. The algorithm is as follows
One thing to note here is that, like the convolution layer, which has a weighted shared layer, the mean and variance of the wx+b is calculated for the entire map, and in a layer of batch_size * Channel * height * width, the total batch_size*height* The width pixel statistic obtains a mean value and a standard deviation, obtains the channel group parameter altogether.
That is to say that each channel sees a batch of data, and then can call the full join layer of batch normalization algorithm.
2, High Resolution classifier
All State-of-the-art detection methods basically use the Imagenet pre-trained model (classifier) to extract features, such as alexnet input images are resize to less than 256 * 256, which leads to low resolution, Difficult to detect. So Yolo (v1) first trained the classification network with the resolution 224*224, then needed to increase the resolution to 448*448, which not only switched to the detection algorithm but also changed the resolution. So the author would like to be able to improve the resolution in the pre-training time, training only by the classification algorithm switch to the detection algorithm.
YOLOV2 first modifies the pre-trained classification network at a resolution of 448*448 and trains 10 rounds (epochs) on the imagenet data set. This process gives the network enough time to adjust the filter to accommodate high-resolution input. Then fine tune to detect the network. Map has been boosted by 4%.
3. convolutional with Anchor Boxes (use preset box)
YOLO (v1) uses the full-connection layer data for the bounding box prediction (to reshape the 1470*1 's full-link layer to the final feature of 7*7*30), which would lose more space information. YOLOv2 borrowed from the anchor idea in faster R-CNN: A simple understanding of the convolution feature map for sliding window sampling, each center forecast 9 different sizes and proportions of the box. Since all the convolution does not need to be reshape, the spatial information is well preserved, and each feature point of the final feature map corresponds to each cell one by one of the original. Moreover, the prediction relative offset (offset) is used to simplify the problem and facilitate the network learning.
In general, remove the full connectivity layer (for more spatial information) using anchor boxes to predict bounding boxes. The following are the specific practices:
· Removing the final pooling layer ensures that the output convolution feature map has a higher resolution.
· Reduce the network, so that the image input resolution of 416 * 416, the purpose is to make the resulting convolution feature graph width is odd, so that a center cell can be generated. Because the authors observed that large objects usually occupy the middle of the image, they can be predicted using only a central cell, or the 4 cells in the middle will be used to predict the object, a technique that can slightly increase efficiency.
· Using convolutional layer Drop sampling (factor is 32), the 416 * 416 image of the input convolutional network is finally given a convolution feature graph (416/32=13) of 13 * 13.
· The mechanism of the predictive category is decoupled from the spatial location (cell) and the anchor box predicts both the category and the coordinates. Since YOLO is responsible for the prediction category by each cell, the 2 bounding box corresponding to each cell is responsible for predicting the coordinates (recalling the features of the last output 7*7*30 in YOLO, each cell corresponds to 1*1*30, the first 10 is 2 bounding box is used to predict coordinates, and the next 20 indicates the probability that the cell belongs to 20 categories under the condition that it contains the object. In YOLOv2, the category's predictions are no longer tied to each cell (space location), but are placed in the anchor box.
Here is an additional explanation, before the convolutional neural network, through the full link to the output of a cell corresponding to the feature, where the convolution layer of the feature map out, give each feature point K preselection box (before rcnn the size of the pre-selection box is manually selected, the method is said here), The size and position of the candidate boxes are then further processed.
With the addition of anchor boxes, it is expected that the recall rate will increase and the accuracy rate may decrease. Let's calculate, assuming that each cell predicts 9 suggestion boxes, then a total of 13 * 13 * 9 = 1521 boxes, whereas the previous network only predicts 7 * 7 * 2 = 98 boxes. The data are: there is no anchor boxes, the model recall is 81%,map 69.5%, the anchor boxes is added, the model recall is 88%,map to 69.2%. In this way, the accuracy is only a small decline, while the recall rate has increased by 7%, indicating that further work can be done to enhance the accuracy of the real improvement of space.
4, Dimension Clusters (dimension Clustering)
When using anchor, the author discovers that the number of anchor boxes in FASTER-RCNN and the width and height dimensions are often manually selected prior frames (hand-picked priors), and it is envisaged that a better and more representative transcendental boxes dimension can be selected from the beginning. Then the network should be easier to learn accurate forecast location. The solution is the K-means clustering method in statistical learning, by clustering the ground true box in the data set to find the statistical law of the ground true box. The number of clusters K is anchor boxs, and the width and height dimension of K Cluster Center box is anchor box dimension.
If the Euclidean distance function is used according to the standard K-means, the large boxes produces more error than the small boxes. However, what we really want is to produce a good IOU score boxes (regardless of the size of box). Therefore, the following distance measurements are used:
D (box,centroid) = 1-iou (box,centroid)
Above: With the increase of K, the IOU is also increasing (high recall rate), but the complexity is also increasing. So after the balance complexity and IOU, the K value is finally 5. Above right: the center of the 5 cluster is completely different from the manual selection of the boxes, and the flat long box has fewer tall boxes (this is the power of statistical law).
The back part is next.