YOLO9000 (YOLO v2), a state of the art, real time object detection system

Source: Internet
Author: User

Object detection has developed rapidly in the last two years, from RCNN, fast rcnn to towards real time faster rcnn, then real time YOLO, SSD, generation faster than a generation (fps), The generation is stronger than the generation (MAP), faster and stronger, but today is about the real better, faster, and stronger of the a state of the art system----YOLO9000 (and v2).


YOLO v1 A real-time target detection system, can reach 45fps, so high processing speed, thanks to its simplified framework, compared to faster Rcnn,yolo v1 first removed the RPN network, output directly back out of bounding box, Secondly, through the simplified convolution layer (faster than VGG16) to speed up the detection time, the combination of the two methods, in the 448x448 image directly to the real-time monitoring effect. The network structure is shown in the following figure.


YOLO V1 through the above method to improve the speed, but 63.4 of the map is not faster rcnn high, because of its removal of the RPN network, abandoned anchor boxes mechanism, and YOLO of the Convolutional network (YOLO customized Architecture) is not as good as the Vgg effect. The SSD makes up for YOLO, combined with YOLO V1 's end-to-end idea and rcnn anchor mechanism of faster boxes, and still uses the VGG16 network, taking into account the speed and accuracy, 73.1mAP 58fps is achieved only in the 300x300 image input. Speed improvement on the one hand, the SSD anchor boxes is not generated by the network alone, but directly location on the feature map, on the other hand SSD cut off time-consuming calculations, such as the YOLO 4096-node FC layer, instead of 1x1 Conv Layer The 300x300 input also reduces detection time because it combines multiple feature maps, so SSDs can achieve such high detection accuracy at small-scale inputs. The network structure of the SSD is shown in the following figure.


OK, today our protagonist is coming up------YOLO9000 (and v2). YOLO v2 achieved 67fps 76.8mAP, 40fps reached 78.6mAP, the effect is significantly higher than the SSD. More than that, YOLO9000 by combining different data sets (ImageNet and COCO) training methods, you can identify more than 9,000 classes of target detection, more importantly, it still runs in real time.

YOLO v2 has a full of dry goods, such as batch normalization such as the general trick, and dimension cluster such a fresh trick. YOLO9000 is a wordtree mechanism that can be used to combine different datasets, and it's also a big innovation (though it's not an egg for me at the moment). Now the target detection system tends to be complex (larger and deeper), but YOLO V2 is simplifying the model, Pool A variety of idea, also improves speed and precision. Novel ideas in the table below, almost every item has an elevation of one to two points.


1. Better:

The importance of batch normalization batch normalization is self-evident, accelerating convergence, preventing overfitting, and partially replacing the dropout simplification model, resulting in a 2% accuracy increase for YOLO.

High Resolution Classifier YOLO v2 Although and V1 also adopted the 448x448 input, but the network training method is not the same, V2 first in imagenet to do epochs fine, and then tune Tune its resulting network, so that the web has time to adjust his filter (filters), so that it can better run at high resolution. The high Resolution Classifier has made the YOLO V2 4% improvement, which is a very important promotion.

convolutional with Anchor Boxes YOLO v1 and fast rcnn have different predictive methods for bounding box, YOLO V1 uses the position of the fully connected layer to directly predict the coordinate system, while fast RCNN uses predictive Anchor Offset of the boxes. YOLO v2 also uses the anchor boxes mechanism, which removes the entire connection layer, making training simple. Using the Anchor Boxes YOLO v2 must be changed on the input, from the original 448x448 input into 416x416, in order to get the odd-scale feature map, the location into a single center, Otherwise the location out of two centers is absolutely impossible. It should be noted here that the use of anchor boxes, the accuracy is slightly lower, but, but recall greatly increased, 81% to 88%, which means that the model of the progress of a lot of space, can be improved by the trick behind.

Dimension Clusters adopted the anchor boxes, it is necessary to decide what size of box (box dimensions), manually selected method of weak explosion (not for faster rcnn, is all), It's no use in the artificial intelligence age to pat the head. YOLO v2 on the VOC and Coco data set, with K-means algorithm automatically cluster out box dimensions, as shown in the figure below, although the more accurate dimensions, but considering the complexity of the model, choose 5 Dimension of the boxes most suitable, through the right figure can also be seen, VOC and Coco box dimension different, very intuitive and accurate practice.


The Direct location prediction uses fast RCNN's anchor boxes mechanism, which brings another problem in predicting the boxes of offset, the training instability that takes a lot of time to make it converge, The reason is that using offset to calculate the coordinates, so that the predicted boxes is very arbitrary without constraints, can fall anywhere in the image. So the YOLO V2 uses the predicted relative grid cell coordinates, which limits the predictive value to [0,1] and then uses the logistic activation activation, as shown in the following figure. Such restrictions are very easy to learn, combined with the anchor boxes of dimension clusters, and also improve the 5% MAP.


Fine-grained Features YOLO v2 feature maps for the 13x13, which is sufficient for the identification of large objects, the small object is to be fine-grained Features. Fast rcnn and SSDs are all directly using different sizes of feature maps to make final predictions, YOLO v2 takes a different approach, which is to add 26x26 feature maps directly behind 13x13. This is 26x26x512 feature maps directly separated into 13x13x2048, the change in this way will increase the correct rate of 1%.

Multi-scale Training initially YOLO input 448x448, add anchor boxes after the input becomes 416x416. But YOLO only the convolution layer and the pool layer, can be long-scale training, in order to make the model more stable, so YOLOv2 do a multi-scale training. At the time of training, each of the ten batches randomly from {}320; 352; :::; 608}, select a dimension. This method allows the YOLO v2 to accommodate more input sizes, while satisfying the need for speed. Multi-scale Training allows YOLO v2 to make a balanced selection of speed and accuracy, 228x228 input YOLO can reach 90fps, can be applied to smaller GPUs (e.g. we do embedded) for real-time monitoring and high frame rate video detection. The subscript lists the accuracy and speed comparison of the YOLO V2, which shows the speed advantage of the YOLO.


2. Faster

YOLO v2 not only to pursue the accuracy rate, more importantly, speed, robot control, autonomous driving technology, it must rely on low-latency systems. Although VGG16 is a stable and accurate model, the model is a bit too complex, and on the input of 224x224, The VGG16 requires a floating-point calculation of 30.96billion times, and the YOLO model based on the Googlenet architecture requires only 8.52billion floating-point calculations, which are many times faster (but TOP5 accuracy drops a little, 88.0% vs 90%),

Darknet-19 YOLO V2 's new model inherits the consistent idea of model design: Using 3x3 filters like VGG16, adding a feature of channals maps after pooling; making final predictions with global average pooling ; Use 1x1 filters to compress feature maps between 3x3 convolution, accelerate convergence with batch normalization, and rule model. So the final design of the model called DARKNET-19 (Diablo Network, the name of the domineering side of the leak is mainly used the darknet side leakage of the library, the side leakage network, ah no, Diablo network mainly has 19 convolutional layer 5 pool layer, processing a picture only need 5.58 billion times calculation.

Training for classification YOLO v2 on the imagenet on the 1000 categories training epoches, with the parameters as described in the literature, here is not BB, training methods are commonly used, for the moment is not a table. Training is also initialized with 224x224, and then used with 448x448 fine tune.

Training for detection do detection, because added anchor boxes, so modified the output of the network, removed the last convolutional layer, added 3x3x1024 convolution layer, plus the original passthrough layers, Finally output the last predicted boxes. Training strategies with fast rcnn and SSDs.

3. Stronger

Stronger's YOLO9000 debut, why stronger, not only because it can identify 9,000 categories, it is important to build a training mechanism to mix and utilize multiple datasets. This mechanism can let YOLO in training, encounter classification data will only reverse the transmission of class error, encountered detection data on the same time spread the category and regression error. More importantly, however, it has built wortree, putting all categories together like a tree------hierarchical classification mechanism, a mechanism that is too witty.

Hierarchical classification imagenet the category label is stripped from the WordNet, the WordNet structure is a graph structure, such as dogs are mammals and livestock, this structure is not applicable on the category label. Therefore, when the wordtree is established, the graph is simplified, the word with only one path is added to the Wordtree, and then, in a little bit, when it encounters two path to root, it selects the one with the shorter path after adding the word.

The final result of Wordtree is actually a conditional probability model, so if you want to compute the probability of a node, simply multiply all the node results on that path:


Since Imagenet has 1000 categories, it is necessary to set up wordtree for imagenet, increase the path node, and finally expand the category to 1369 categories, so that in the training phase, you need to attach several categories to a picture, such as a husky to be attached to both dogs and animals. Unlike most imagenet models, on the classification function, YOLO9000 uses a single softmax for each sub-class in Wordtree, as shown in the following figure:


Using Wordtree to train the network brings some benefits, such as on some uncertain objects, if the network is not sure to see a golden Retriever or husky, the network will still tell you that this thing must be a dog. This method is also applicable to detection, in detection, of course, can not pretend that each picture has an object, detector must be clearly calculated the probability value of the object, including bounding box and conditional probabilities. It's time to predict, right? The tree looks upside down and outputs the object above the threshold at the back node.

The dataset combination with Wordtree was built wordtree to train multiple datasets at the same time, mainly using imagenet and Coco two datasets, and the following illustration shows examples of training these two datasets, The diversity of Wordtree is strong and can be used to train multiple other datasets.


The 9,000 categories of Joint classification and detection YOLO9000 are the first 9,000 categories selected for Imagenet, and then expanded into 9,418 categories according to Wordtree. Because Imagenet is much larger than Coco, so when training to coco more samples, maintain the proportion of 4:1. The YOLO9000 also uses the YOLOV2 architecture, but anchors uses only 3, reducing the number of outputs.

During training, if you encounter detection image, do the reverse propagation normally. If you encounter classification image, only spread the classification error, such as the dog's label, only the reverse propagation of the dog's parameters, do not need to be transmitted to the Husky and the Golden retriever, because there is no information to pass. When propagating classification errors, only the bounding box with the highest score is used, and only 0.3 of IOU are assumed and ground truth.

After the training of the YOLO9000, The results are: 19.7mAP, tested on 156 categorical data that have not been studied, map reached 16.0,yolo9000 's map is higher than DPM, and YOLO has more advanced features, YOLO9000 is trained in a partial supervised manner on different training sets while detecting 9,000 object categories and guaranteed to run in real time.

Although YOLO9000 's ability to recognize animals is good, the recognition rate for sunglasses or swimming trunks is very low, which is related to data sets, because Coco has almost no clothes-class bounding box.

4. Summary

YOLO v2 and YOLO9000 are real-time detection systems, YOLO V2 achieves state of the art, and can be weighed between speed and accuracy. YOLO9000 can do real-time detection of 9,000 categories, as well as the training of most data sets through the establishment of Wordtree, which is a major step in the field of most data-gathering training.

Thinking:

1. By simplifying the model, the detection speed can be greatly improved, the future plus the optimized algorithm and the special chip, the speed will be faster, real-time performance is not a problem at all.

2. Deep learning universal tricks, especially through the great God verified, is really useful, to be good at combining various trick, a little bit to improve performance.

3. Most of the existing models may be too complex, such as VGG16, etc., through batch normalization these skills, has gradually improved the model expression ability, so the appropriate simplification model, appropriate optimization, has been able to meet the application of some occasions.

4. There are many articles to be done on the training method, and the adjustment of parameters has great influence on the training result.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.