Deep learning target detection (object detection) series (i) r-cnn
Deep learning target detection (object detection) series (ii) spp-net
Deep learning target detection (object detection) series (iii) Fast R-CNN
Deep learning target detection (object detection) series (iv) Faster R-CNN
Deep learning target detection (object detection) series (v) R-FCN
In the previous seven articles, we introduced five R-CNN series of deep learning target detection algorithm and the other two algorithms: YOLO and SSD, the general deep learning target detection problem can be divided into these two classes, one is the R-CNN series of regional recommendations classification problem, one is yolo/ SSD Series regression problem, the following introduction based on the regression method of YOLO2, see the name also know, YOLO2 is the improvement of YOLO, is the second edition of the Yolo series, further improve the accuracy and speed of detection. YOLO2 Structure
The implementation of the YOLO series has its own framework called draknet, which is a pure C framework, whether YOLO or YOLO2, in code implementation is darknet, change is the network structure of the configuration file, first we look at what it really is:
From the above figure we can see that YOLO 2 has 32 layers. Structure is still relatively conventional, mainly in the use of 3*3 convolution, 2*2 of the pooling and 1*1 convolution. In addition to the above three general operations, there are also reorg and route, where route is on the 25 and 28 levels, reorg on the 27 floor.
Route:
Route layer is the meaning of merging, such as 28 layer of the route to the 27 and 24 layers together to output to the next layer, the 27th layer of output is the 13*13*256,24 layer output is 13*13*1024, so in the third dimension to do stitching is the 28 layer of output, Also 29-layer input: 13*13*1280. Similarly, the 24th tier route is only 16, then do not merge, directly to the 16 layer of output to take over as 24 layers of output: 26*26*512.
reorg:
Reorg is very much like a reshape, but Reshape's way is novel, it transforms the output of 26*26*64 to 13*13*256, because 26*26*1 can become 13*13*4.
In this way, the YOLO2 32-layer structure of the comb clear input and output, and finally we focus on a 30th layer of output, is 13*13*125. As the 30th layer is 1*1 convolution, it is used 125 1*1*1024 convolution core, the final output 13*13*125. So why is the output such a shape?
13*13 Nothing to say, is 169 pixels, each pixel point has 125 layers of thickness. After this layer is directly predicted, so this 169*125 number contains all the information YOLO2 need, borders, categories and so on.
125 refers to the 25*5, of which 5 refers to 5 area suggestions box, this operation is in the reference faster R-CNN anchor suggestion box, if you see this series, in faster R-CNN this number is 9, only in YOLO2, it becomes 5.
The last number left: 25, which is related to the number of categories to be detected, YOLO2 in the Prediction Class 20 (VOC DataSet), so 25 of the arrangement is 20+4+1, where:
20 is the class score (probability), is the current one pixel, corresponding to a reference box is the probability of a class;
4 is δ (TX) δ (T x) \delta\left (t_{x}\right), δ (Ty) δ (t y) \delta\left (t_{y}\right), TX t x t_{x}, Ty T y t_{y}, 4 The number will be used to calculate the position and size of the border;
1 is confidence, which indicates the probability of a real object in the bounding box prediction.
Therefore, the structure of YOLO2 is changed according to the number of categories detected, if we do single-target detection, the other and the same words, then the final layer of output will be: 13*13* (5* (1+4+1)), that is, 13*13*30. Here, by the way, spit a notch CNN structure, we do network structure analysis or network structure design, often have this feeling, the number of output want it to represent what it represents, and sometimes even seems to be unreasonable, but finally this operation can have effect, The most important reason is that the convolution operation is of no practical significance, it is only a strong extraction ability, but it does not know what to extract, so if we design the appropriate loss function, we can arbitrarily specify the output, even if the designation does not seem to make sense. CNN is essentially a very complex, expressive, and potentially powerful function that connects input to output, but the ultimate ability of this function can be fully developed, depending on a lot of things, loss functions, training techniques, data sets, and so on.YOLO2 Border Calculation
The border of the
YOLO2 is related to each pixel point on the final feature map, and each pixel will have five reference frames, five of which are related to anchor and STX, sty, TW, th, and we look directly at the following formula:
In the past we have been talking about anchor this thing, in fact, it is not abstract, but very specific. A anchor is a logarithm, is width and height, for example, we write a pair, then it is a anchor, in the above figure of the formula width is Pw P w p_{w},height is Ph p h p_{h}. YOLO2 in determining the value of anchor, is based on the data set to be predicted, it in advance statistics of VOC boundingbox of the length and width distribution, selected 5 pairs of more appropriate anchor, this statistical way in the paper called Dimension Clusters (Dimension clustering), in fact, is a k-means, to the number of clusters K for the number of anchor Boxs, K Cluster Center box of the width of the high dimension for the Anchor box dimension. But the problem with the standard K-means is that the big bbox will generate more error than the small bbox, even if they are closer to the actual cluster center, so in order to solve this problem, dimension clustering redesigned the distance rating:
D (box,centroid) = 1−iou (box,centroid) d (b o x, c e n t r o i d) = 1−i o U (b o x, c e n t r o i d) d\left (box,centroid\right) =1-iou \left (box,centroid\right)
after K-means prediction, they are:
(1.3221,1.73145)
(3.19275,4.00944)
(5.00587,8.0 9892)
(9.47112,4.84053)
(11.2364,10.0071)
In addition, Cx C x c_{x} and Cy C y c_{y} are pixel locations on the feature map, δ (t