Rich Feature Hierarchies for accurate object detection and semantic segmentation (understanding)

Source: Internet
Author: User
Tags svm
0-Background

This paper is a classic paper of cvpr in 2014. Its model is called regions with convolutional neural network features. It was a state-of-art model in the field of object detection.

1-related knowledge supplement 1.1-selective search

This algorithm is used to generate a coarse-selected regions region. In my other blog post, select search for Object Recognition (understanding.

1.2-unsupervised pre-training & supervised pre-training 1.2.1-unsupervised pre-training (unsupervised pre-Traning)

Stack-based self-coding and DBM use unsupervised pre-training. In the pre-training phase, samples do not need to be manually labeled. (Detailed ideas will be supplemented later)

1.2.2-supervised pre-training (supervised pre-training)

Supervised pre-training can be called migration learning. After the network is trained on another training set, the parameter is used as the initialization parameter of the current task network, the accuracy is greatly improved compared to the direct use of random initialization and other methods.

In this paper, the training data of image classification is much more than the data of object detection. Therefore, the parameter is used to initialize the network parameter of the Target Detection by pre-training the network of the classification dataset, is a highlight of this paper.

1.3-IOU

IOU is the matching degree between the bounding box and the real box provided by the algorithm. The formula is $ IOU = \ frac {(A \ cap B) }{ (A \ cup B )} $, equivalent to $ IOU = \ frac {s_ I} {(s_a + S_B-S_ I)} $

        

1.4-non-maximum suppression

For example, when detecting a target, the algorithm may give a bunch of bounding boax. In this case, it is necessary to determine which Bounding Boxes are useless and discard them. Assume that there are six bounding boxes, which are sorted Based on the classification probability of the classifier class. The probability of the target belongs to A, B, C, D, E, and F respectively from small to large.

  • Start from the maximum probability rectangle box F and judge a ~ respectively ~ The overlap between E and F is greater than the set threshold value in IOU.
  • If the overlap between B, D and F exceeds the threshold, B and D will be lost, and the first bounding box F will be retained.
  • From the remaining bounding box A, C, and E, select the e with the highest probability, and then judge the overlap between A, C, and E. If the overlap is greater than the threshold, discard it, and Mark E as the second bounding box we keep.
  • Repeat the above process to find all reserved Bounding Boxes

         

2-Overall Thinking

First, enter an image, use selective search to locate 2000 object detection boxes, and then use CNN to extract the feature vectors of the images in each candidate box. the dimension of the feature vector is 4096 dimensions, then, linear SVM is used to classify each feature vector. To sum up, there are three steps:

  • Find candidate boxes
  • Each candidate box uses CNN to extract feature vectors
  • Use linear SVM to classify feature vectors

   

2.1-heterosexual Scaling

Because the size of candidate boxes generated by selective search is different, traditional CNN requires that the input image scales be fixed, therefore, the image size needs to be scaled by the same-sex or the same-sex scaling method.

2.1.1-auto Scaling

Regardless of the aspect ratio, scale the image directly to a fixed value of $227 \ times 227 $, as shown in (d). The advantage is that it is simple, and the disadvantage is that it can easily cause distortion of the target.

2.1.2-same-sex Scaling
  • Method 1: Expand the bounding box boundary to a fixed scale in the original image, and then crop the box. If the bounding box has been extended beyond the boundary of the original image, fill with the pixel color mean in the bounding box, as shown in (B ).
  • Method 2: first crop the bounding box and then fill it with a fixed background color to the desired fixed size scale (the background color is also the pixel color mean of the bounding box), such as (c) as shown in

              

In this paper, padding processing is also proposed. Rows 1 and 3 Use padding = 0, while rows 2 and 4 Use padding = 16. After the experiment, the author found that the precision of adopting the auto scaling and padding = 16 is the highest (here, the author suggests that the effect of image distortion is not as big as we can intuitively feel ).

2.2-positive and negative Sample Labeling

The bounding box produced above cannot exactly match the box manually labeled. Therefore, we need to tag these bounding boxes to facilitate next CNN training. The labeling rules are as follows:

  • If the IOU of the bounding box and the real box is greater than 0.5, the bounding box is a positive sample with the corresponding object category tag.
  • Otherwise, a negative sample is classified as a background.
3-training 3.1-CNN network architecture

The CNN structure for feature extraction has two optional solutions: alexnet and vgg 16. After testing, the accuracy of alexnet is 58.5%, and that of vgg 16 is 66%, but the calculation of vgg 16 is about 7 times that of alexnet.

3.2-CNN supervised pre-training

The target detection dataset is small. The current training volume is far from enough when random initialization parameters are used. Therefore, CNN is first trained using the imagenet classification dataset, and then the model structure is adjusted to adapt to the detection task, directly use the classification model parameters and then perform fine-tuning training. (A random gradient descent optimization method is used, and the learning rate is 0.001 ).

3.3-fine-tuning stage

Use the candidate box generated by selective search to process it to a specified size scale, and perform fine-tuning training on the CNN model after supervised pre-training. Assume that the object to be detected has N categories, You need to replace the last layer of the pre-trained CNN model with n + 1 Output neuron (with an additional 1 representing the background ), this layer uses the random initialization method, and other network layer parameters remain unchanged. Then, SGD (random gradient descent) is used for training. (Note: The SGD learning rate is 0.001 and the batch size is 128, of which 32 positive samples + 86 negative samples are selected ).

3.3-Thoughts on CNN

Question 1: Can I use alexnet as the Feature Extraction Tool without fine-tuning training?

This idea is also tested in this paper. The experimental results show that the accuracy of the $ p_5 $ output in the network as the feature extraction result is similar to that of $ f_6 $ and $ f_7 $, on the contrary, the extracted features of $ f_6 $ are slightly higher than those of $ f_7 $. We can sum up a rule that uses CNN as the Feature Extraction Tool instead of fine-tuning specific tasks. The features learned by the convolution layer are actually basic generalized features, in the future, the full connection layer will learn more about the representation of specific tasks.

Question 2: Why does CNN output support classification through a softmax layer?

Through the above positive and negative sample labeling process, we can know that the number of positive and negative samples in the training set is far smaller than that of the negative sample, and the positive sample is labeled as long as the IOU is greater than 0.5 (the condition is relaxed ), however, the effect of CNN is related to the size of training data. Therefore, a small amount of labeled data is not enough for CNN, and SVM is more suitable for a small number of samples. Therefore, SVM is better than softmax, therefore, the author uses an additional SVM classifier. (Note: The experiment shows that the best effect is achieved when the IOU threshold is set to 0.3. When the IOU threshold is set to 0.0, the IOU threshold is decreased by 4 percentage points, and when the IOU threshold is set to 0.5, the IOU threshold is decreased by 5 percentage points)

4-References

51882818

50824226

Rich Feature Hierarchies for accurate object detection and semantic segmentation (understanding)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.