Yolo Principle: Https://zhuanlan.zhihu.com/p/24916786?refer=xiaoleimlnote for reference googlenet, there are 24 convolutional layers + 2 Fully connected layers: The convolution layer is responsible for feature extraction, and the fully connected layer is classified as regression.
steps to detect: 1. The image is scaled to 448*448, the image is divided into 7*7 (s*s) cell 2. The image is extracted by convolution layer, and the target box of two (b) maximum probability is analyzed in each cell, and the Position information of box is obtained (center coordinate, high, width) and the probability of the object in box (confidence), and 20 (C) results for the classification target, the probability size of 20 species for the target in each box is obtained. 3. The 7*7* (5*2+20) dimension (or the s*s* (5*b+c) dimension) of the Convolutional network is obtained by making the category and the Bbox regression through the fully connected layer.
Training Time: Pre-training classification network with classification training data (IMAGENET): 20-layer convolutional layer + average pooled +FC, input image is 224*224 then, the results of the classification pre-training are migrated to the detection network (448*448 20-layer conv+ randomly initialized 4-layer CONV+2FC) for classification Training in which to explain
the construction process of loss functionAlso interesting: For the balance between the coordinate parameter (8) and the category parameter (20), the weighting coefficient is introduced, the weight coefficient of the coordinate detection is obviously bigger than the category prediction weight, and the confidence of the object is not in the majority of bbox, in order to avoid the non-convergence caused by the direct 0. The weight coefficient of conficence prediction with object is set to normal 1, the weight factor without object is set to less than 1 (0.5), the weight coefficient of the classification is set to 1, and for each object, Take only the IOU largest box to be responsible for its detection (the forecaster's predictor is only for specialized categories), and for different bbox sizes, offset error is calculated by RMS, in order to avoid the linear calculation, although the calculated position offset value is the same, but in fact the large bbox position offset is relatively not serious, And the small bbox position offsets too much of this phenomenon.
at the time of the forecast: Each bbox corresponds to the confidence value of an object (Pobj | IOU) and classification information (Pclass | Pobj), the flight of both (class-specific confidence score) represents the classification probability and accuracy information for each bbox, sorting confidence score in each category in FC, and removing the lower boxes with a threshold value, Then perform NMS (non-maximum suppression) to get the final classification box;
defects: The detection of objects and smaller objects near the center is not good, mainly because only two boxes (and only one category) are predicted in a grid. The generalization ability is poor when the object shows a rare aspect ratio and other conditions, and the main reason for the poor detection effect is the location error, because the position factor has great influence on the construction of the loss function, especially in the processing of the size object.
Training Details: https://zhuanlan.zhihu.com/p/25045711 The author has trained and tested the Pascal VOC2007 and Pascal VOC2012 datasets. Training 135 rounds, batch size is 64, momentum is 0.9, learning rate delay is 0.0005. Learning schedule for: first round, The learning rate increased slowly from 0.001 to 0.01 (because the model would diverge if the initial high learning rate), 0.01 to 75 rounds, then 30 in the rear 0.001, and 30 for the last 0.0001 rounds. The author also uses dropout and data Augmentation to prevent overfitting. Dropout values of 0.5;data augmentation include: Random scaling,translation,adjust exposure and saturation.
yolo2/yolo9000On the basis of YOLO improvement, mainly reflected in the enhancement of positioning accuracy: There are 10 improvement + joint training + multi-layer Classification batch normalization: batch normalization, accelerating convergence high Resolution Classifier: higher pixel classified image data (224-- >448) convolutional with Anchor Boxes: The author removes the fully connected layer of YOLO and uses a fixed frame (Anchor Boxes) to predict bounding Boxes. First, a pooling layer is removed to improve the output resolution of the convolution layer. Then, modify the network input size: change from 448x448 to 416, so that the feature map has only one center. Items (especially large items) are more likely to appear in the image center. Improved accuracy thanks to the anchor boxes. Yolo each picture predicts 98 boxes, but with anchor boxes, each image can predict more than 1000 boxes. Dimension Clusters:anchor Box's initial need for manual selection, manual selection of the appropriate box can accelerate convergence. Using K-means Clustering method to automatically select the best initial boxesdirect location prediction: The method of guaranteeing and accelerating anchor box position convergence: predicting position matching fine-grained Features: More precise features (finer grained Features) can improve detection of small targets. The authors add Passtrough layers to the network to add features. Passthrough is similar to ResNet, combining high-resolution features with low-resolution features Multi-scale Training: During training, the model input size is changed every few rounds to make the model robust for different size images. Every 10batches, the model randomly selects a new input image size (320,352,... Multiples of 608,32 because the sampling factor under the model is 32). This training rule forces the model to adapt to different input resolutions. The model is faster for small-size input processing, so the YOLOv2 can adjust speed and accuracy as required. At low resolution (288x288), the YOLOV2 can handle speeds of up to 90FPS at the same rate as the fast r-cnn. At high resolution, the accuracy of the YOLOv2 on the VOC2007 dataset can be achieved in the state of the art (78.6mAP) darknet The most important three struct definitions are network_state, network, LAyer; The new version of Network_state has been incorporated into the network. The code can ignore the GPU part first, and the different kinds of network layers are defined by the function pointer inside the layer forward backward and update to define the execution rules of this kind. such as connected layer has Forward_connected_layer Backward_connected_layer update_connected_layer three methods, GRU layer and so on is the same Atomic operations are only in BLAS.C and GEMM.C, network operations in NETWORK.C, most importantly train_network_datum, Train_networks, Train_network_batch and Network_pre Dict;train_network_datum is the input data with Float_pair, is float *x, float *y pair;
Yolo Algorithm Learning