Paper Reading notes: Yolo9000:better,faster,stronger

Last Update:2018-08-16 Source: Internet

Author: User

Tags passthrough

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

paper Reading notes: Yolo9000:better,faster,stronger

This article mainly includes the following content:

Paper Address
Reference Blog

Paper reading notes YOLO9000 Betterfasterstronger review YOLOV1 main ideas Network improvement Better Experimental Results Network improvement faster-draknet19 Network improvement stronger

Review YOLOv1

Give an input image, first dividing the image into a 7 * 7 grid. For each grid, each grid predicts 2 bouding box (each box contains 5 predictions) and 20 category probabilities, a total output of 7x7x (2*5+20) = 1470 tensor based on the previous step can predict 7 * 7 * 2 = 98 target windows, then root The target window is less likely to be removed by a threshold, and the redundant window is removed by the NMS.

YOLOV1 uses the End-to-end regression method, without region proposal steps, the direct regression completes the position and the category judgment. Various reasons make the YOLOV1 not so accurate in the target localization, the direct result YOLO the detection precision is not very high. Main Ideas

YOLO V2: Represents the current level of the industry's most advanced object detection, its speed faster than other detection systems (FASTERR-CNN,RESNET,SSD), users can be in its speed and accuracy between the tradeoff.
YOLO9000: This network structure can detect more than 9000 kinds of objects in real time, thanks to its use of wordtree, through Wordtree to mix the detection dataset and identify the data in the dataset. (Classification information learning from Imagenet classified data sets, object location detection learning from COCO detection data set)
A new joint training algorithm (joint training algorithm) is proposed, which is used to train both imagenet and COCO data sets. Basic idea: At the same time in the detection dataset and the classification data set training object Detector (object detectors), using the data of the monitoring dataset to learn the exact location of the object, using the data of the classification dataset to increase the category of classification, enhance the robustness. Network Improvement Better

YOLO has many drawbacks, the author hopes to improve the direction is: improve the recall rate, improve the accuracy of positioning, while maintaining the accuracy of classification.
Batch Normalization

The network is optimized by using Batch normalization, which improves the convergence of the network and eliminates the dependence on other forms of regularization (regularization). By adding Batch normalization to each of YOLO's convolution layers, the MAP was eventually improved by 2%, while the model was also regularization. Use the Batch normalization to remove the dropout from the model without creating a fit. High resolution classifier

The classification network of YOLO v2 is the first Fine tune on Imagenet, which is used to detect the network 448*448. Finally, by using high resolution, MAP was upgraded by 4%. Convolution with anchor boxes

YOLO contains a fully connected layer, which can directly predict the coordinate value of bounding Boxes. The faster R-cnn method uses only the convolution layer and the Region Proposal network to predict the offset value and confidence of the Anchor Box, rather than the direct prediction of the coordinate value.
YOLO deletes the full join layer and the pool layer, and 416*416 the input size by selecting the Feature Map of a 13*13.
Using Anchor Box allows for a slight decrease in accuracy, but it allows YOLO to predict more than 1000 boxes, while recall reaches 88%,map to 69.2%. Dimension clusters

The size of the Anchor Box for the faster R-cnn method is manually selected, so there is room for optimization. To optimize, run a K-means cluster on the training set (training set) bounding Boxes to find a better value.
If you use a standard K-means method for clustering, even with Euclidean distances, larger boxes produce more errors than boxes with smaller sizes. Therefore, the IOU is used to measure distances. The final set of K-means clustering category is K=5 class.
Direct Location Prediction

YOLO v2 to obtain the final coordinate frame by predicting the offset of Anchor, the concrete method is presented here. The network predicts 5 bounding Boxes in each grid unit, each bounding Boxes has five coordinate values tx,ty,tw,th,t0. It is also assumed that a grid unit is an offset to the upper-left corner of the picture cx,cy,bounding Boxes Prior width and height is pw,ph.
Fine-grained Features

The modified YOLO's Feature Map size is 13*13, plus a passthrough Layer to obtain the characteristics of a layer of the previous 26*26 resolution. The PASSTHROUGH layer can associate high-resolution features with low-resolution features by stacking adjacent features in different Channel, turning 26*26*512 into 13*13*2048, and so on. The detector in YOLO is above the extended (expanded) Feature Map, so he can obtain fine-grained feature information, which increases the performance of the YOLO 1%. Multi-scaletraining

YOLO v2 Changes network parameters several times per iteration. Every 10 Batch, the network will randomly select a new picture size, because the use of the next sampling parameter is 32, so the different size is also selected to 32 of the multiple {320,352.....608}, minimum 320*320, maximum 608*608, the network will automatically change the size, and continue the training process.
This policy allows the network to achieve a good predictive effect on different input dimensions, and the same network can be detected at different resolutions. When the input picture size is relatively small, run faster, the input picture size is larger when the precision is high, so you can YOLO v2 speed and accuracy of the tradeoff.
Experimental Results

Network Improvement faster-draknet19

YOLO v2 is based on a new classification model, somewhat similar to Vgg. YOLO V2 uses 3*3 filter, which increases the number of channels after each pooling. YOLO V2 uses the global average pooling, uses the Batch normilazation to make the training more stable, accelerates the convergence, causes the model to standardize.
The final model–darknet 19, there are 19 convolution layer and 5 maxpooling layer, processing a picture only need to 5.58 Billio operation, imagenet 72.9% to achieve top-1 accuracy, 91.2% top-5 accuracy.

In the detection phase: The network removes the last convolution layer and adds three 3*3 convolution layers, each with 1024 Filters per convolution layer followed by a 1*1 convolution layer, with the number of outputs we need for detect Ion
For VOC data, the network predicts five bounding Boxes per grid unit, each bounding Boxes predicts 5 coordinates and 20 classes, so a total of 125 Filters, adding passthough layer to obtain fine-grained information of the front layer, Network Improvement Stronger

Through conditional probability, a layered tree wordttree is established. At the same time the corresponding categories are expanded. In the case of constructing wordttree, only the synonyms under the same concept should be softmax classified.
The advantage of this approach is that the performance is reduced gracefully (gracefully) when sorting unknown or new objects. For example, see a picture of a dog, but do not know what kind of dog, then the high confidence (confidence) prediction is "dog", and other types of dog synonyms such as "husky" "bovine head stem" "golden hair" and so on these low confidence.

Using the joint training method, YOLO9000 uses the COCO instrumentation dataset to learn how to detect objects in a picture, using the Imagenet classification dataset to learn how to categorize from a large number of categories.
when trained, use Wordtree to mix the COCO detection dataset with the Top9000 class in Imagenet, and the mixed dataset corresponds to a wordtree of 9,418 classes. On the other hand, because the imagenet dataset is too large, the author, in order to balance the amount of data between the two datasets, COCO the data in the dataset through a sampling (oversampling), so that the amount of data between the COCO dataset and the Imagenet dataset reaches 1:4.
When a network encounters a picture in a test dataset, it propagates normally in the opposite direction, and when a picture of a sorted dataset is encountered, only the loss function of the classification is used to reverse propagate. At the same time, the author assumes that IOU is at least. 3. Finally, the reverse propagation is based on these hypotheses.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More