1. R-cnn:rich feature hierarchies for accurate object detection and semantic segmentation
Technical route: Selective search + CNN + SVMs
STEP1: Candidate Box extraction (selective search)
Training: Given a picture, use the Seletive search method to extract 2000 candidate boxes from it. Due to the size of the candidate boxes, considering that the subsequent CNN requires a uniform image size, the 2000 candidate boxes all resize to the 227*227 resolution (in order to avoid serious image distortion, the middle can take some techniques to reduce image distortion).
Test: Given a picture, use the Seletive search method to extract 2000 candidate boxes from it. Due to the size of the candidate boxes, considering that the subsequent CNN requires a uniform image size, the 2000 candidate boxes all resize to the 227*227 resolution (in order to avoid serious image distortion, the middle can take some techniques to reduce image distortion).
STEP2: Feature extraction (CNN)
Training: The CNN model for extracting features needs to be trained in advance. When training the CNN model, the calibration requirements for training data are more lenient, that is, when the SS method extracts proposal only part of the target area, we also calibrate the proposal to a specific object category. The main reason for this is that CNN training requires a large number of data, and if the calibration requirements are extremely stringent (that is, only areas that contain exactly the target area and that are not part of the target can exceed a small threshold), then there will be very few samples for CNN training. Therefore, the CNN model trained under the condition of loose calibration can only be used for feature extraction.
Test: After obtaining the proposal of the unified resolution 227*227, the CNN model is brought into the training, and the output result of the last fully connected layer---the 4096*1 dimension vector is the characteristic of the final test.
STEP3: Classifier (SVMs)
Training: Strict calibration for all proposal (as can be understood, if and only if a candidate box completely contains the ground truth area and does not belong to the ground truth part does not exceed e.g, the candidate box area of 5% is considered the candidate box calibration result as the target, otherwise bit background), Then, all the features of proposal and the new SVM calibration results are input into the SVMS classifier to train the classifier prediction model.
Test: For a test image, the extracted 2000 proposal are extracted by the CNN feature and input into the SVM classifier predictive model, which can give a specific category of scoring results.
Result generation: Get SVMs for all proposal scoring results, some of the lower points of the proposal removed, the remaining proposal will appear in the case of the candidate box intersection. Using the non-maxima suppression technique, for the two boxes or boxes that intersect, find the candidate box that best represents the final test result (non-maximum-value suppression method can be consulted: http://blog.csdn.net/pb09013037/article/details/45477591)
R-CNN needs to perform a forward-to-CNN feature extraction for each proposal obtained by SS, so the computational amount is large and cannot be real-time. In addition, due to the existence of the full-join layer, it is necessary to strictly ensure that the input proposal eventually resize to the same scale size, which causes image distortion to a certain extent and affects the final result.
2. Spp-net:spatial Pyramid Pooling in deep convolutional Networks for Visual recognition)
Traditional CNN and Spp-net processes are shown for example (quoted in http://www.image-net.org/challenges/LSVRC/2014/slides/sppnet_ilsvrc2014.pdf)
Spp-net has the following characteristics:
1. In traditional CNN networks, the convolution layer does not require the size of the input image, but the full connection layer requires a uniform size for the input image. Therefore, in r-cnn, the different size proposal for the selective search method requires that the proposal area be cropped to a uniform size by crop operation or wrap operation, and then the proposal feature is extracted using CNN. In contrast, Spp-net adds an SPP (spatial pyramid pooling) layer between the last convolutional layer and the fully connected layer thereafter, thus avoiding crop or warp operations on the propsal. All in all, the spp-layer is suitable for different sizes of input images, spp-layer the last convolution feature for pool operation and generates a fixed size feature map to match the subsequent full join layer.
2. Because spp-net supports different size input images, the image features extracted by spp-net have better scale invariance, which reduces the possibility of overfitting during training.
3.R-CNN in training and testing is required for each proposal in each image to carry out a CNN pre-feature extraction, if it is 2000 propsal, it takes 2000 times before the CNN feature extraction. But spp-net only need to make a forward-to-CNN feature extraction, that is, the whole graph is CNN feature extraction, the last convolution layer of the feature map, and then use Spp-layer according to the spatial correspondence relationship to get the corresponding proposal characteristics. Spp-net speed can be faster than r-cnn speed 24~102 times, and the accuracy is higher than r-cnn (from spp-net original paper, you can see spp-net before spp-layer There are 5 convolution layer, the 5th convolution layer of the output characteristics in the position can correspond to the original image, For example, the bottom left wheel in the first figure shows the active area of "^" in its conv5 diagram, so based on this feature, the spp-net only needs to go through the forward convolution of the whole plot, after the conv5 features obtained, and then extract the corresponding proposal features with spp-net respectively).
Spp-layer principle:
In Rnn, the CONV5 is Pool5, and in Spp-net, the Spp-layer is substituted for the original POOL5, the goal is to make the input images of different sizes have the same length of eigenvectors after Spp-layer. The principle is as follows
SPP is similar to pyramid pooling, where we first determine the Featuremap size that the final pooling obtains, such as 4*4 bins,3*3 bins,2*2 bins,1*1 bins. Then we know the featuremap size of the CONV5 output (for example, 256 13*13 feature map). Well, for a 13*13 map of feature, we can pass spatial pyramid pooling (SPP) way to get output: when Window=ceil (13/4) =4, Stride=floor (13/4) = 3, can get 4*4 bins; when Window=ceil (13/3) =5, Stride=floor (13/3) = 4, can get the 3*3 bins; when Window=ceil (13/2) =7, Stride=floor (13/2) = 6, can get 2*2 bins; Window=ceil (13/1) =13 (13/1) =13, can get the 1*1 bins. So the output after Spp-layer is a vector of 256* (4*4+3*3+2*2+1*1) =256*30 length. It is not difficult to see that the key implementation of SPP is through the CONV5 output feature map width and the SPP target output bin width and height calculation spatial pyramid pooling different resolution bins corresponding pooling window and pool Stride size.
The original author was trained in two different ways, that is, 1. The same size of the image training Spp-net 2. Use different sizes of image training spp-net. The experimental results show that the spp-net effect is better by using different size input image training.
Spp-net +SVM Training:
Using selective search can be extracted to a series of proposals, because has been trained to complete spp-net, then we first put the entire map into the spp-net, the output of conv5. Next, unlike R-CNN, the new method does not need to be crop or wrap for different sizes of proposals, and the map output of proposal in the entire Graph CONV5 output is calculated directly based on the relative position relationship of the proposal in the graph. So, for the 2000 proposal, we actually did the CONV1--->conv5 only once, then made the set map of the Conv5 Featuremap 2000 times, and then through Spp-layer, 2000 Spp-layer output vectors of the same length can be obtained, and the final 2000 proposal convolution neural network features are generated through the full join layer. Next, similar to R-CNN, the training of SVMs is strictly calibrated for all proposal (this can be understood, when and only if a candidate box completely contains ground truth area and does not belong to ground truth part of e.g, the candidate box area of 5% When the candidate box is considered to be the target, otherwise the background), and then all the proposal features and new SVM calibration results are input into the SVMS classifier to train the classifier prediction model.
Of course, if you feel that SVM training is very troublesome, you can add a Softmax layer directly after the spp-net, with good calibration results to train the final Softmax layer parameters.
3. fast-r-cnn
Based on R-CNN and Spp-net thought, RBG proposed FAST-R-CNN algorithm. If the VGG16 network is selected for feature extraction, the speed of fast-r-cnn can be increased by 9 times and 3 times times compared to RCNN and spp-net in the training stage, and fast-r-cnn speed can be increased 213 times and 10 times times compared with RCNN and spp-net in the test phase respectively.
R-CNN and Spp-net Disadvantages:
The training process of 1.R-CNN and spp-net is similar, it is carried out in several stages and the process is more complicated. These two methods first use selective search method to extract proposals, then use CNN to achieve feature extraction, and finally based on SVMS algorithm training classifier, on this basis can further study the detection of the target boulding box.
The time cost and space costs of 2.R-CNN and spp-net are higher. Spp-net in the feature extraction phase, we only need to do the whole graph to CNN, and then calculate each proposal corresponding CNN feature through the spatial mapping method, which is different from the former, rcnn in the feature extraction stage, it needs to be done to CNN before each proposal. Given the large number of proposal (~2000), the time cost of RCNN feature extraction is high. The characteristics of the R-CNN and spp-net used to train the SVMs classifier need to be kept on disk in advance, considering that the total number of CNN features of the 2000 proposal is still relatively large, resulting in higher space costs.
3.R-CNN detection speed is very slow. RCNN in the feature extraction phase for each proposal need to do a forward to the CNN calculation, if the feature extraction with VGG, processing an image of all proposal need 47s.
4. Feature extraction the training of CNN and the training of the SVMs classifier are sequential in time, and the training methods of the two are independent, so SVMs's training loss can not update the convolution parameters before spp-layer, so even with a deeper CNN network for feature extraction, There is no guarantee that the accuracy of the SVMS classifier will be improved.
fast-r-cnn Highlights:
1.FAST-R-CNN detection results better than r-cnn and spp-net
2. Training method is simple, based on multi-task loss, no SVM training classifier is required.
The 3.FAST-R-CNN can update network parameters for all layers (using the ROI layer will no longer need to use the SVM classifier, which enables end-to-end training across the network).
4. You do not need to cache the feature to disk.
FAST-R-CNN Architecture:
The architecture of the FAST-R-CNN is as shown (https://github.com/rbgirshick/fast-rcnn/blob/master/models/VGG16/ Train.prototxt, you can refer to this link to understand the network model): Enter an image and a series of proposals generated by the selective search method, generate feature map through a series of convolution layers and pooling layers, The feature map obtained by processing the last convolution layer with the ROI (region of Ineterst) layer generates a fixed-length eigenvector roi_pool5 for each proposal. The output roi_pool5 of the ROI layer is then entered into the full join layer to produce the features that are ultimately used for multitasking learning and are used to compute multitasking loss. The fully connected output consists of two branches: 1. SoftMax Loss: Computes the classification Loss function of the K+1 class, where K represents the K target category, 1 is the background, and 2. Regression Loss: That is, the K+1 classification results corresponding proposal bounding box four corner coordinate values. Finally, all results are produced through non-maximal inhibition processing to produce the final target detection and recognition results.
3.1 RoI Pooling Layer
In fact, the RoI Pooling layer is a simplified form of spp-layer. Spp-layer is the space pyramid pooling layer, including different scales; the RoI layer contains only one scale, such as the 7*7 described in the paper. So for the ROI layer input (r,c,h,w), the ROI layer first produces blocks (block) of 7*7 r*c* (H/7) * (W/7), and then uses Max-pool to find the maximum value of each block so that the output of the ROI layer is r* C*7*7.
3.2 Pre-Training network initialization
RBG using the network model (e.g. VGG16 model) that the predecessors trained Imagenet to initialize all the layers before the ROI layer in the FAST-R-CNN model, we can summarize the network structure as follows: 13 convolutional layers + 4 pooling layer +roi Layer + 2 FC Layer + Two lateral layers (i.e., softmaxloss and smoothl1loss layers). Among them, VGG16 's 5th pool layer is replaced by the ROI layer.
3.3 Finetuning for detection
3.3.1 Fast-r-cnn used some trick in the network training phase, with each minibatch composed of R n=2 (proposal) in N pictures (r=128). This is more than 64 times times the way to extract 1 proposal from 128 different images. Of course, this approach can cause a slow rate of convergence at some point. In addition, FAST-R-CNN does not require a SVM classifier, but instead updates all parameters in a joint training Softmax Classifer and Bounding-box regressors. Note: When selecting 128 proposals from 2 graphs, it is necessary to ensure that at least 25% of the proposals and Groundtruth IOU more than 0.5, and all the rest as background classes. No additional data amplification operations are required.
3.3. The 2 multitasking loss:fast R-CNN Network has two sibling layers, respectively, for classification and regression. The classification chooses Softmaxloss, the regression uses the Smoothl1loss. The weight ratio of the two is 1:1.
3.3.3 SGD hyer-parameters: FC Layer parameters for Softmax classification tasks and Bounding-box regression are initialized with a Gaussian distribution between the standard deviations between 0.01~0.001.
3.4 Truncated SVD Rapid detection
In the detection section, RBG uses truncated SVD to optimize the larger FC layer, which accelerates the detection end speed when the number of ROI is large.
FAST-R-CNN Experimental Conclusion:
1. Multi-Task loss learning method can improve the algorithm accuracy rate
2. Multi-scale image training FAST-R-CNN can only improve the tiny map compared to the single-scale image training, but the time cost increases a lot. Therefore, considering the training time and map, the author proposes to train FAST-R-CNN directly with a scale image.
3. Basically no one would disagree: the more training images are, the higher the model accuracy rate will be.
The results of 4.RBG show that Softmaxloss is a little bit better than the results of SVMs classifier, although this can not be absolute to explain their softmaxloss good where to go, but at least we do not have to be so troublesome to train a detection and identification network.
5. It is not said that the more proposal the more effects will be better, too much will cause the map to drop.
4. Faster-r-cnn:towards real-time Object Detection with region proposal Networks
In the previous FAST-R-CNN, the first step is to use the selective search method to extract the proposals in the image. CPU-based selective search extracting all proposals of an image takes about 2s of time. In the case of proposal extraction, FAST-R-CNN can detect the target in real-time. However, from an end-to-end perspective, it is obvious that proposal extraction is a bottleneck that affects the performance of end-to-end algorithms. At present, although the newest edgeboxes algorithm improves the accuracy and efficiency of the candidate frame extraction to some extent, it still needs 0.2s to process an image. Therefore, Ren Shaoqing proposes a new FASTER-R-CNN algorithm, which introduces RPN network (region proposal networks) extraction proposals. RPN Network is a full convolutional neural network, the proposal can be extracted by sharing convolutional layer features, RPN extracting a image of proposal only need 10ms.
The FASTER-R-CNN algorithm consists of two modules: 1.PRN candidate Frame Extraction module 2.Fast r-cnn detection module. Among them, RPN is a full convolutional neural network for extracting candidate boxes, and Fast R-CNN detects and identifies targets in proposal based on RPN extracted proposal.
4.1 Region proposal Network (RPN)
The input to the RPN network can be a picture of any size (but still with the minimum resolution required, such as Vgg is 228*228). If VGG16 is used for feature extraction, then the RPN network can be represented as VGG16+RPN.
VGG16: reference Https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/pascal_voc/VGG16/faster_rcnn_ End2end/train.prototxt, it can be seen that the part used for feature extraction in VGG16 is 13 convolutional layers (conv1_1---->conv5.3), excluding the network hierarchy after Pool5 and POOL5.
RPN: RPN is a network that the author focuses on, as shown in. RPN implementation: In the Conv5-3 convolution feature map with a n*n sliding window (the author chose N=3, that is, 3*3 sliding window) to generate a length of 256 (corresponding to the ZF Network) or 512 (corresponding to the VGG network) dimension length of the full-join feature. Then, after this 256-dimensional or 512-dimensional feature, the full-join layer of two branches is generated: 1.reg-layer, which predicts the coordinates x, Y, and proposal of the proposal of the center anchor point for the W,h 2.cls-layer, used to determine whether the proposal is a foreground or a background. Sliding window processing ensures that Reg-layer and cls-layer all feature spaces associated with conv5-3. In fact, it is easy to understand the implementation of the RPN layer by the author using the fully-connected layer implementation method, but the author chooses the convolution layer to realize the function of the full-join layer in the implementation. Personal understanding: The fully connected layer is a special convolution layer, if the 256 or 512-dimensional FC features, in fact, can be used num_out=256 or kernel_size=3*3, stride=1 convolution layer to achieve conv5-3 to the first fully connected feature mapping. Then use two num_out for the 2*9=18 and 4*9=36,kernel_size=1*1,stride=1 convolution layer to achieve the feature mapping of the previous layer feature to the two branch CLS layer and the Reg layer respectively. Note: The classification result of the CLS layer in 2*9 here consists of two classes, the previous and the background, 4 representing a proposal center point coordinate of X, Y, and 4*9 w,h four parameters. Full-connection processing with convolution does not reduce the number of parameters, but makes the size of the input image more flexible. In the RPN network, we need to focus on understanding the anchors concept, Loss fucntions calculation method and RPN layer training data generation specific details.
Anchors: It can literally be understood as an anchor point, located at the center of the n*n sliding window mentioned earlier. For a sliding window, we can predict multiple proposal at the same time, assuming there are k. K proposal is k reference boxes, each reference box can also be used with a scale, the anchor point in a aspect_ratio and sliding window is uniquely determined. So, when we say a anchor in the back, you understand it as a anchor box or a reference box. The author defines k=9 in the thesis, that is, 3 scales and 3 aspect_ratio determine the current sliding The corresponding 9 reference boxes at the window position, 4*k reg-layer output and 2*k cls-layer output. For a w*h feature map, it corresponds to the w*h*k anchor point. All anchor points have scale invariance.
Loss functions: Before calculating the Loss value, the author sets the anchors calibration method. Normal Sample calibration Rules: 1. If the IOU value of the corresponding reference box and ground truth of the anchor is the largest, it is marked as a positive sample, 2. If anchor corresponds to reference box and ground truth 0.7, marked as positive sample. In fact, the 2nd rule can basically find enough positive samples, but for some extreme cases, such as all anchor corresponding to the reference box and Groud truth IOU is not greater than 0.7, you can use the first rule generation. Negative Sample Calibration rule: If anchor corresponds to reference box and ground truth iou<0.3, it is marked as a negative sample. The remainder is neither a positive sample nor a negative sample, and is not used for final training. Training RPN loss is classification loss (that is, Softmax loss) and regression loss (that is, L1 loss) by a certain proportion of the composition. The calculation of Softmax loss requires the anchors corresponding Groundtruth calibration results and prediction results, the calculation regression loss need three groups of information: 1. The prediction box, which is the RPN network to predict the center position of the proposal of X, Y and W high, H;2. Anchor Reference Box: The previous 9 anchor points correspond to the reference boxes of 9 different scale and aspect_ratio, each reference boxes has a center point position coordinate x_a,y_a and wide height w_a,h_ A. 3.ground Truth: The calibrated box also corresponds to a center point position coordinate x*,y* and a wide height w*,h*. So the calculation regression loss and the total loss way are as follows:
RPN Training settings: When training RPN, a mini-batch is made up of 256 proposal selected in an image, with the ratio of positive and negative samples to 1:1. If the positive sample is less than 128, more negative samples are used to meet the 256 proposal that can be used for training, and vice versa. When the RPN is trained, the parameters in the model can be copied directly from the Vgg, and the remaining layer parameters are initialized with the Gaussian distribution of the standard deviation =0.01.
4.2 RPn and FASTER-R-CNN feature sharing
RPN after extracting proposals, the author chooses to use FAST-R-CNN to achieve the final target detection and recognition. RPN and FAST-R-CNN shared 13 Vgg convolution layers, and it was not a wise choice to fully isolate the two networks, and the authors used the convolution layer feature sharing in alternating training stages:
Alternating training (alternating training): STEP1: Training RPN; STEP2: Proposal training fast r-cnn with RPN extraction; STEP3: Initializes the convolutional layer used by the RPN network with the faster r-cnn. The iteration executes the step1,2,3 until the end of the training. This training method is used in the paper, note: The first iteration, the model obtained by Imagenet initializes the parameters of the convolution layer in RPN and fast-r-cnn; from the second iteration, when training RPN, The shared convolutional layer parameters in the RPN are initialized with the FAST-R-CNN shared convolution parameter, and then only the corresponding parameters of the non-shared convolutional layer and other layers are fine-tune. When training the FAST-RCNN, keep the convolutional layer parameters that are shared with the RPN unchanged, and only fine-tune the parameters of the non-shared layer. In this way, two network convolution layer feature sharing training can be realized. For the corresponding network model, please refer to https://github.com/rbgirshick/py-faster-rcnn/tree/master/models/pascal_voc/VGG16/faster_rcnn_alt_opt
4.3 Deep excavation
1. Due to the proposal scale of selective search extraction, the ROI generated by FAST-RCNN or spp-net is also different in scale, and the ROI Pooling Layer or spp-layer processing to get the fixed size pyramid feature, in this process, the return of the final proposal coordinate network weight in fact shared the entire featuremap, so its training network accuracy will be higher. However, the ROI of the RPN method is generated by K anchor points and has k different resolutions, so the K independent regression method is learned in the course of training. This approach does not share the entire FEATUREMAP, but the network accuracy it trains is also high. This, I was speechless. If you have any questions, please find anchors classmates.
2. The use of different resolution images to a certain extent can improve the accuracy, but also lead to reduced training speed. The use of VGG16 training RPN Although the 13th convolutional layer feature size is reduced to at least 1/16 of the original size (in fact, considering the kernel_size effect, will be smaller), and then the egg, the final detection and recognition effect is still good to me speechless.
3. Three scale (128*128,256*256,512*512), three aspect ratio (1:2,1:1,2:1), although the scale range is large, the overall feeling will be strange, but the end result is still very good.
4. When training (e.g. 600*1000 input image), if the boundary of reference box (i.e. anchor box) exceeds the bounds of the image, such anchors does not have an effect on training loss, that is, ignoring such loss. A piece of 600* 1000 of the figure after VGG16 about 40*60, then the number of anchors is about 40*60*9, approximately equal to 20,000 anchor boxes. Remove the anchor boxes that intersect the image boundary, leaving about 6,000 anchor boxes, There are a lot of overlapping areas between so many anchor boxes, so using non-extremum suppression methods to merge all the areas of iou>0.7, leaving 2000 anchor boxes(in the same way, at the final detection end, You can set a rule that has a probability greater than a threshold of p and that IOU is larger than a threshold T (note that, unlike before, not anchor boxes) is combined with a non-maximum suppression method. During each epoch training process, randomly sampling 256 anchor boxes from a graph of the final remaining anchors as a Mini-batch training RPN network.
4.3 experiments
1.PASCAL VOC 2007: Use Zf-net to train RPN and FAST-R-CNN, then selectivesearch+fast-r-cnn, EDGEBOX+FAST-R-CNN, rpn+ The accuracy rate of FAST-R-CNN is as follows: 58.7%,58.6%,59.9%. The Seletiveseach and Edgebox methods extract 2000 PROPOSAL,RPN to extract up to 300 proposal, so the convolution feature sharing method extracts features RPN obviously in efficiency is more advantageous.
2. Using Vgg to train rpn+fast-r-cnn with feature sharing mode and feature share, the accuracy rate of 68.5% and 69.9% can be obtained respectively (VOC2007). In addition, the use of VGG training rcnn, need to spend 320ms Extract 2000 proposal, added SVD optimization needs 223ms, and faster-rcnn the entire forward process (including RPN+FAST-R-CNN) a total of 198ms.
The number of scales and aspect_ratio in 3.Anchors does not have a significant effect on the results, but for algorithmic stability, it is recommended that two parameters be set to the appropriate values.
4. When the number of proposal retrieved by selective search and Edgebox is reduced from 2000 to 300, FASTE-R-CNN recall vs. The Recall value in the IoU overlap ratio chart is significantly reduced, but the RPN number extracted from 2000 is reduced to 300, proposal vs. The recall values in the IoU overlap ratio chart are relatively stable.
4.4 Summary
Feature sharing training RPN+FAST-R-CNN can achieve excellent detection effect, feature sharing training realized buy one get one, RPN in extracting proposal not only have no time cost, but also improve proposal quality. So the RPN+FAST-R-CNN mode of alternating training in FASTER-R-CNN is higher than the original slectiveseach+fast-r-cnn.
5.yolo:you only look once:unified, real-time Object Detection
Yolo is a convolutional neural network that can predict multiple box positions and classes at once, enabling end-to-end detection and recognition of targets with the greatest advantage of being fast. In fact, the essence of target detection is regression, so a CNN that implements regression does not need a complex design process. Yolo does not choose sliding window or extracting proposal way to train the network, but directly selects the whole graph training model. The advantage of this is that you can better distinguish between the target and the background area, in contrast, the FAST-R-CNN with proposal training methods often mistakenly detect the background area as a specific target. Of course, YOLO has sacrificed some precision while lifting the detection speed. The Yolo detection system flow is shown as follows: 1. Resize the image to 448*448;2. Run cnn;3. Non-maximal suppression optimization test results. Interested children shoes can be installed according to http://pjreddie.com/darknet/install/instructions to test the YOLO scoring process, very easy to get started. Next, we will focus on the principle of YOLO.
5.1 Integrated Inspection Solutions
Yolo's design philosophy follows end-to-end training and real-time detection. Yolo divides the input image into a s*s network, and if the center of an object falls within a grid (cell), the corresponding mesh is responsible for detecting the object. During training and testing, each network predicts B bounding boxes, each bounding box corresponds to 5 predictor parameters, which is the center point coordinates (x, y), Width height (w,h), and confidence score of the bounding box. The confidence score here (pr (object) *iou (Pred|truth)) comprehensively reflects the probability of the target location based on the current model bounding the presence of the object in the box, PR (object) and bounding box predicting the accuracy of the target position IOU (pred| Truth). If there is no object in the Bouding box, the PR (object) = 0. If there is an object, the IOU is computed based on the predicted bounding box and the real bounding box, while the object is predicted to belong to a certain class of posterior probability PR (class_i| Object). Assuming that there is a common Class C object, then each mesh only predicts the conditional class probability of a Class C object Pr (class_i| Object), i=1,2,..., C; each grid predicts the position of the B bounding box. That is, the B bounding box shares a set of conditional class probability PR (class_i| Object), i=1,2,..., C. Based on the calculated PR (class_i| Object), a bounding box class-related confidence can be calculated at the time of testing: Pr (class_i| Object) *pr (object) *iou (Pred|truth) =PR (class_i) *iou (Pred|truth). If the input image is divided into a 7*7 mesh (s=7), each grid predicts 2 bounding box (b=2), there are 20 types of targets to be detected (C=20), it is equivalent to the final prediction of a length of s*s* (b*5+c) =7*7*30 vector, so as to complete the detection + recognition task, The entire process can be understood through.
5.1.1 Network Design
YOLO Network design follows the Googlenet thought, but differs from it. The YOLO uses 24 cascaded convolution (CONV) layers and 2 fully connected (FC) layers, where the conv layer includes 3*3 and 1*1 two kernel, and the last FC layer is the output of the YOLO network with a length of s*s* (B*5+c) =7*7*30. In addition, The author also designed a simplified version of the Yolo-small network, including 9 cascaded conv layers and 2 FC tiers, because the conv layer is a lot less, so yolo-small speed is much faster than YOLO. As shown, we give the architecture of the YOLO Network.
5.1.2 Training
The author trains The YOLO Network in steps: First, the author removes the first 20 conv layers from the network, and then adds a average pooling layer and an FC layer, using 1000 classes of imagenet data and training. The accuracy of TOP5 obtained by using 224*224d image training on ImageNet2012 is 88%. The author then adds 4 new conv layers and 2 FC layers after 20 pre-trained conv layers, and initializes these newly added layers with the following parameters, and when the new layer is fine-tune, the author chooses 448*448 image training. The last FC layer predicts the probability of objects belonging to different classes and bounding box center point coordinates x, Y, and w,h. The width of the boundingbox is relative to the image width and height of the resulting, bounding box's center coordinates are relative to a grid of position coordinates are normalized, so x,y,w,h are between 0 to 1.
When designing the loss function, there are two main problems: 1. For the last layer length is the 7*7*30 length prediction result, the calculation forecast loss usually chooses the square sum error. However, the positional error and the classification error of this loss function are 1:1 relations. 2. The entire figure has 7*7 mesh, most of the mesh does not actually contain objects (when the center of the object is located in the grid to calculate the inclusion of objects), if only the PR (class_i), a lot of grid classification probability is 0, grid loss shows the characteristics of sparse matrix, so that the loss convergence effect is poor, the model is unstable. To solve the above problems, the author adopts a series of schemes:
1. Increase the loss weight of bounding box coordinate forecast, reduce the loss weight of bounding box classification. The weights of coordinate prediction and classification prediction are λcoord=5,λnoobj=0.5 respectively.
2. The squared sum error is the same for large and small bounding box weights, in order to reduce the variance of bounding box width and height predicted by different sizes, the square root form is used to calculate the width and height prediction loss, i.e. sqrt (w) and sqrt (h).
Training loss composition form is more complex, here do not enumerate, if interested can refer to the author's original text slowly understand the experience.
5.1.3 Test
The author chooses pasal VOC image test training to get the YOLO network, each picture will predict 98 (7*7*2) bouding box and the corresponding class probability. Usually a cell can directly predict an object's corresponding bounding box, but for some objects larger or closer to the boundary of the image, the result of multiple mesh predictions is generated by non-maximal suppression processing. Although the YOLO relies less on R-CNN and DPM for non-maximum inhibition, non-maximum suppression can actually increase the map by 2 to 3 points.
5.2 Method Comparison
The author compares the Yolo target detection and recognition method with several other classical schemes:
DPM (deformable parts models): DPM is a method of target detection based on sliding window, and the basic flow includes several independent links: feature extraction, Region division, and prediction bounding box based on high score region. Yolo uses an end-to-end training approach, which connects feature extraction, candidate frame prediction, non-maximal suppression and target recognition to achieve a faster and more accurate detection model.
R-CNN: The R-CNN scheme must first extract the proposal with the Seletivesearch method, then use CNN to extract the feature and finally train the classifier with SVM. Such a scheme, Prudential cumbersome also! The essence of YOLO is similar, but the proposal and target recognition are extracted by means of shared convolution feature. In addition, YOLO use grid to proposal space constraints, to avoid in some areas of the repeated extraction proposal, compared to Seletivesearch extraction 2000 proposal for R-CNN training, YOLO only need to extract 98 proposal, How can training and testing speed be unpleasant?
FAST-R-CNN, FASTER-R-CNN, FAST-DPM:FAST-R-CNN, and FASTER-R-CNN replaced SVMs Training and Selectiveseach method of proposal extraction, respectively, Speed up training and testing to some extent, but its speed is still not comparable with YOLO. In the same vein, DPM optimizations are implemented on the GPU without YOLO on the right.
5.3 Experiments
5.3.1 Real-time detection and recognition system comparison
Comparison of accuracy rate of 5.3.2 VOC2007
5.3.3 FAST-R-CNN and YOLO Error analysis
, different regions represent different indicators:
Correct: Correct detection and recognition of the proportions, namely the correct classification and iou>0.5
Localization: Classified correctly, but 0.1<iou<0.5
Similar: Similar in category, iou>0.1
Other: Classification error, iou>0.1
Background: for any target iou<0.1
As you can see, the YOLO is less accurate than fast-r-cnn when locating the target position. In Yolo's error, the target location error occupies the largest proportion, which is 10 points higher than the FAST-R-CNN. However, YOLO is more accurate in locating the background, and it can be seen that fast-r-cnn false positives are high (background=13.6% that a box is a target but does not actually contain any objects).
Comparison of accuracy rate of 5.3.4 VOC2012
Because the YOLO is more obvious in the process of target detection and recognition, the author designs The Fast-r-cnn+yolo detection and recognition mode, which is to extract a set of bounding box first with R-CNN, and then use YOLO to process the image to get a set of bounding box. Compare the two sets of bounding box is basically consistent, if the probability of the YOLO calculated by the same target classification, the final bouding box area to select the intersection of the region. The maximum accuracy of the fast-r-cnn can be up to 71.8% and the Fast-r-cnn+yolo can be used to increase the accuracy rate to 75%. This increase in accuracy is based on the YOLO error on the Test side, unlike the fast-r-cnn. Although the Fast-r-cnn_yolo improves accuracy, the corresponding detection rate is greatly reduced, resulting in the inability to detect in real time.
Using VOC2012 to test the map=57.9% of mean Average Precision,yolo for different algorithms, this value is equivalent to VGG16-based rcnn detection algorithm. For the test results of different size images, the authors found that the accuracy rate of YOLO in detecting small targets is about 8~10% than R-CNN, and the accuracy rate is higher than r-cnn in the detection of large targets. The accuracy of Fast-r-cnn+yolo is the highest, and the accuracy rate is 2.3% higher than that of FAST-R-CNN.
5.4 Summary
Yolo is a convolutional neural network that supports end-to-end training and testing, and can detect and recognize multiple targets in images under the premise of guaranteeing certain accuracy.
6.SSD: SingleShot multibox Detector
R-cnn,spp-net, FAST-R-CNN,FASTER-R-CNN, YOLO, SSD series deep learning detection method combing