Objective
Originally wanted to follow the convention to a overview, the result saw a very good and detailed introduction, so copy over, their own in front of the general summary of the paper, the details do not repeat, citing the article is very detailed.
Paper Overview Citation article
The following are from: http://lowrank.science/SNIP/
This log records some notes on the following article CVPR 2018 Oral.
Singh B, Davis L S. An analysis of scale invariance in Object detection–snip[c]//proceedings of the IEEE Conference on computer Vision and Pat Tern recognition. 2018:3578-3587.
Paper Link: https://arxiv.org/abs/1711.08189
Code Link: Https://github.com/bharatsingh430/snip
Argument
At the beginning of the paper, the author put a fact, for Image classification has been able to do super-human level performance, but the Object Detection is very far away, so the author asked a question: Why are object detection so much harder than image classification?
The explanation given by the author is Large scale variation across object instances, the scale variation not only exists inside the target dataset itself, but also exists Between the pre-trained dataset and the target dataset to apply.
- For the extreme scale variation within the target dataset itself to apply, the author gives the following diagram to illustrate. The CDF of the ordinate is the cumulative distribution function, the cumulative distribution functions; The Relative scale should be that the Object should be long or wide, occupying the length or width of the image. As can be seen from this picture, most of COCO's goals are concentrated under relative scale 0.1, with an area of less than 1. There are actually two problems here:
- One is the goal itself is very small, how to better characteristics of the small target, that is, so that the CNN itself can detect small targets.
- Another problem is that because COCO is mostly small targets, and small targets because they are small, so the scale of the multiple is actually very large, scales of 0.0001 and 0.1 is 1000 times times the difference between the size, but because half of the target is concentrated in 0.1, so the special small scale In fact, there are a lot of goals, and can not be ignored. That is, a large number of non-negligible very small objects, which makes the scale variation on the object detection dataset very large. So for COCO, the need for CNN to have a very small scale and very large scales (the ratio between the two is very large, such as 0.0001 vs 0.9) between the target has a good classification ability to have a good performance, but also need to have a extreme size VA The robustness of riation, i.e. scale-invariant.
- For the scale variation between the pre-trained dataset and the target dataset to apply, the author gives Domain-shift the term. ImageNet is used for image classification, the target is generally larger than the scale, and the object detection data set of the target scale difference is very large. In the ImageNet of this large target data set pre-trained features, directly used in the detection of object detection of those small targets, you can expect the effect is not very good, this is the result of Domain-shift.
In the final analysis, the object detection is not good at present, or because there are a large number of very small objects exist itself, and small objects detection is difficult because:
- Small objects because small, the internal scale difference is very large (multiples, because the denominator is very small, one will be very large), the detector needs a strong scale-invariance ability, and CNN on its design itself is not Scale-invariance;
- Small objects itself small, in ImageNet such Median scale objects the Datasets pre-trained Features directly to Detect small objects The effect is not good, because domain-shift.
- CNN Extract semantic feature caused by coarse representation and detect small objects need fine resolution between the contradiction , small objects because For small, it is very difficult to coarse representation and very good expression, it is possible to be ignored
- The deeper layers of modern cnns has large strides (pixels) that leads to a very coarse representation of the input IMA GE, which makes small object detection very challenging.
- So essentially because the strides is too large, the representation of the original image is very coarse, and in this coarse representation, the small target itself is easily overlooked.
- In fact, this problem exists in Semantic segmentation, and we want to be able to have both Fine Resolution and Semantic, which is why some improvements and Semantic segmentation The same reason as the law practice.
In order to alleviate the problem caused by the scale variation and small object instances itself, there are probably the following ideas:
- Features from the layers near to the input, referred to as shallow (er) layers, is combined with deeper layers for detecti Ng Small object instances [23, 34, 1, 13, 27],
- Typical representative is FPN, SSD
- This route is actually used to deal with the above difficulties 3,coarse representation vs fine Resolution
- But the author points out the high level semantic features (at CONV5) generated even by feature pyramid networks would not be useful for class Ifying small objects, high-level features useless if the target is small, this is reasonable, because this approach did not deal with the difficulties 1 and 2 so of course do anything flawless
- Dilated/deformable convolution is used to increase receptive fields for detecting large objects [32, 7, 37, 8]
- This route is also used to deal with the above difficulties 3, in order to finally have a fine resolution representation
- Independent predictions at layers of different resolutions is used to capture object instances of different scales [36, 3 , 22]
- This route is also used to deal with the above difficulties 3, when doing prediction can be done at the appropriate level of Feature abstraction (Resolution)
- Context is employed for disambiguation [38, 39, 10]
- I don't know, I need to see the paper.
- Training is performed over a range of scales [7, 8,] or, inference are performed on multiple scales of an image pyramid
- This route for small targets is actually on the sampling, very violent but also very effective, at the same time to deal with the above difficulties 1 scale variation and difficulty 3 the target is too small in coarse representation residual problems, of course, there are problems in this way, This is going to be discussed later.
- Predictions is combined using nonmaximum suppression [7, 8, 2, 33]
In short, the detection of small targets, or solve the problem , that is, small targets do a very good feature, either eliminate the problem itself , the small target eliminated, all upsampling into a big goal, in the small target scale-invariant Upsampling seems to be a more viable option if the characteristics of the But there are a lot of questions to figure out:
- Does upsampling really work?
- What are you going to do, upsampling?
- Who do you want to upsampling? Only for training, or just test, or all?
- If all do upsampling, how to use each other? Do you traverse all the scales? Or fixed scale, in order to be consistent with the pre-trained datasets scale.
Corresponding to the author's original text, the author asked the following two questions:
- Is it critical to upsample images for obtaining good performance for object detection? Even though the typical size of images in detection datasets are 480x640, why are it a common practice to up-sample them to 800x1200? Can we pre-train CNNs with smaller strides on low resolution images from ImageNet and then fine-tune them on detection dat Asets for detecting small object instances?
- When fine-tuning an object detector from a pre-trained image classification model, should the resolution of the Training O Bject instances is restricted to a tight range (from 64x64 to 256x256) after appropriately re-scaling the input images, or Should all object resolutions (from 16X16 to 800x1000, in the case of COCO) participate in training after up-sampling INP UT images?
In this article, the author answers the above questions in turn:
- First, up-sampling is important for small object detection, which is why for detection datasets, it's a common practice to up-sample 480x640 To 800x1200.
- Pre-train CNNs with smaller strides on low resolution images from ImageNet then fine-tune them on detection datasets for DE Tecting Small object Instances This is the way is possible and is advocated in this article, but fine-tuning and test are to be presented in this article on the special Pyramid to do
- In order to eliminate Domain-shift, when doing fine-tuning, you need to limit the training object instances size to a tight range (from 64x64 to 256x256) in order to maintain the PRE-TR ained Datasets's object scales consistent this way works best, not all object resolutions (from 16X16 to 800x1000, in the case of COCO) are involved in the training 。
Therefore, in summary, the contribution of this article or argument is to promote training detector time to use the Pyramid, but only fixed-scale targets are taken to participate in training, the author called this training method Scale Normalization for Image pyramids (SNIP). This article is essentially a paper that discusses how to use image Pyramid, so the subsequent papers are compared to different image Pyramid usage patterns.
The most typical is the following two ways to use:
- scale-specific Detectors: Variation in scale are handled by training separate detectors-one for each scale range.
- A detector is responsible for a scale of objects
- here the sample should be not done Image Pyramid Datasets, so that for each scale, the sample number is reduced, training samples are less, for training a detector, did not put all the samples to use
- Scale Invariant detector: Training a single object detector with all training samples
- Although this is called scale invariant detector, in fact, but just a good hope, in fact, CNN itself is no scale invariance this nature. Even if the final display of a certain ability to detect Multi-scale object, but this is only "the illusion", that is, but CNN is using its powerful fitting ability to force memorize different scale objects to achieve the capacity to force memorize different SC Ale object to reach the capacity, which actually wastes a lot of capacity"[1], that is, capacity is not used in the place to use
So, here is a trade-off, scale-specific detector not use all samples may lead to poor performance; scale invariant detector waste a lot of capacity to force memorize different SCA Le objects, rather than learning semantic information, can also lead to poor performance. The best of course is, do not make a choice, two are to be able to use all samples, and do not waste capacity force memorize different scale objects. In fact, this can be done.
The SNIP of this article, through Image Pyramid, allows each Object to have an expression that falls in the same scale as the pre-trained's ImageNet dataset, and only after the image Pyramid with pre-trained ImageNet datasets with the same scale as the Sample for training, both to ensure the use of all samples, and capacity are used in learning semantic information.
Argument
The author is in "3. Image classification at multiple scales"and"5. Data Variation or Correct scale?" Two demonstration experiments were arranged in two places.
Fining-tuning, whether or not?
"3. Image classification at multiple scales" This section studies the impact of domain shift, but in addition, the author also has to answer another question, that is, since the domain-shift has an impact, then do not use fin E-tuning this way, that is, take pre-trained weights do initialization, directly on the target dataset of object detection train from scratch bad?
The author arranges three demonstration experiments, and finally proves that even if there is a domain shift, the pre-trained + fine-tuning should be used in this way. This is the answer to the question that the author first raised:
Can we pre-train CNNs with smaller strides on low resolution images from ImageNet and then fine-tune them on detection dat Asets for detecting small object instances?
The answer is yes, we can.
In addition, in fact, domain shift is not only between pre-trained datasets and target datasets exist, in fact, when we do test, in order to detect small targets usually do image Pyramid, will reduce, enlarge the image, this , the object in Test Pyramid will also be inconsistent with the object scale of Training, so here is a reminder that the effect of domain shift is considered when using Image Pyramid. /c1>.
Therefore, there is a domain shift between pre-trained data and Training data between domain shift,training data and Test data. But these two are called domain shift, in fact, there is a bit different, pre-trained data and Training data between the domain shift is due to Object in the original Resolution under the scale distribution caused by its own , and the domain shift between Training data and test data is caused by the use of Image Pyramid by test.
Naive Multi-scale Inference
- This experiment uses the pre-trained Weights obtained directly from the data set of the full Resolution to be applied to the Target dataset without fine-tuning.
- But for Detection, the revelation of this experiment is the effect of domain shift between Training data and test data due to Image Pyramid.
- The experiment was Training on the original size of the ImageNet and then tested on the down-sampling up-sampling image;
- The original image to do down-sampling is to get low-resolution, and then the low-resolution figure up-sampling to training image size is to simulate Pyramid inside the UP-SA mpling behavior, because Detection finally is a region proposal area to do classification, therefore, this experiment is to examine Training set and Test set on the Resolution of the difference The effect of different pairs of classification, but in fact also explains the Detection when doing the Training set and Test set on the Resolution difference.
- The Resolution here refers to the degree of clarity of the image.
- The conclusion is that the Resolution difference between the Training set and the test set is the larger, the worse the effect is, so the Resolution of Training set and test set are guaranteed to be consistent.
It is not good to put the size of the object directly to test, it is to take the enlarged small object to really participate in the training.
Resolution specific classifiers
- This experiment was done directly on the Target Dataset of low Resolution Training from scratch, not pre-training.
- Naive Multi-scale Inference This network is applied to the full Resolution data networks, the network itself is relatively complex, cnn-b B should be the meaning of base, that is, the benchmark network, the simulation is in full Reso The lution trained on the benchmark network on the low Resolution image is tested on the effect.
- Resolution specific classifiers This network is trained on low Resolution data and tested on low Resolution data, but in order to allow the network to be applied on low Resolution images , the simplified network is used, so called cnn-s. At this point, although Training data and Test data Resolution consistent, but because the Network is simple, capacity weak, it will also cause poor prediction results.
- At this time to see, whether it is to simplify the network caused by the prediction effect is not good, or Training and Test data Resolution inconsistent Domain Shift on the effect of poor prediction effect, from the experimental results, cnn-s far better than cnn-b, pay attention to the The premise is sufficient data.
- As a result, it can be concluded that, with sufficient data, Domain Shift can cause significant performance damage, meaning that CNN does not have the ability to learn scale invariance, and even when the Image Pyramid is doing Test, CNN Training has not seen the scale of the Object when the effect will be very poor, which in fact also shows the need to let Training data and Test data in a yardstick of importance.
Fine-tuning high-resolution classifiers
- The experiment was done with pre-training on the pre-trained dataset on full Resolution and then fine-tuning on the Target dataset of low Resolution. of course, in order to enter the network, low Resolution Image to do next upsampling.
- Because this is done on the basis of cnn-b fine-tuning, so called cnn-b-ft.
- The effect of cnn-b-ft is significantly better than that of Cnn-s, which shows that for low Resolution Data to fits a simple network with low capacity, it is better to use pre-trained on full Resolution Dataset + Fine-tuning on low Resolution Dataset this way.
- In fact, this is quite reasonable, compared to learning from Scratch random initialization weights, pre-trained weights at least give a suitable weight initialization. Anyway, at the end of the Target Dataset. Note, however, that when fine-tuning, the Target dataset is up-scaling the same size as the pre-trained dataset. This should be done to ensure that the object size is consistent between pre-trained datasets and target datasets.
Fine-tuning, how? Training on + x 1400,test on 1400 x 2000
- This is analog only inference is performed on multiple scales of an image pyramid, Training in the picture on the X 1400, and then the Te on the picture on 1400 x 2000 ST is a strategy that is often used to detect small targets.
- This is the benchmark, and the back is going to be compared to this, this is called 800-all.
Training on 1400 x 2000,test on 1400 x 2000
- This upsampling small target, and Training and Test on the same scale, but the final effect is just a little bit better than 800-all, can be ignored.
- The author gives the words that up-sampling will blows up the Medium-to-large objects which degrades Performance,median size object become too big to Be correctly classified!
- My own understanding is up-sampling, although the small target is reduced by the domain shift between target dataset and pre-trained dataset, but adds medium size objects at Target D The Ataset and pre-trained datasets have a domain shift between them, and a large number of median objects become super targets, and the SC for most of the targets on ImageNet datasets such as scale and pretrained Ale Inconsistent
Scale specific Detectors
- In order to remove the scale variation so that CNN will be able to use the ability to memorizing rather than learning the performance of the Semantic, the author only to the small goal of a certain extent training, that is, the lack of the variation, but The number of training data has decreased.
- Experimental results show that this is worse than 800-all, because removing the median-to-large objects is not conducive to CNN learning semantics, that is, to remove some of the scale of the sample is not conducive to learning semantics, plug the CNN various scale samples let it go Forced memory is also not conducive to CNN learning semantics.
Only small object effect is not good because the data is not enough, large objects in fact for the semantic information is very helpful. You use only part of the data is not as good as the full use of, although not particularly well.
Multi-scale Training (MST)
- Using Image Pyramid to generate multiple Resolution, and then using a CNN to fit all of these different Resolution objects, the final result is similar to 800-all.
- CNN did not learn the ability to scale invariance, forcing it to remember the different size of the target, will damage its ability to learning Semantic, so although the number of Data through the Image Pyramid increase will bring a little gain, also with Lear The loss of N to Semantic capacity has fallen.
- This requires our ideal detector to be able to make use of all the samples, but the samples that feed them can all be in the right scale, allowing CNN to put the power on learning Semantic information.
So the characteristics learned by DNN do not have: rotational invariance, scale invariance? Is the illusion of data heap, or through capacity by different neuron rote
Conclusion
- For Scale-variation, there are two ways of thinking, one is to increase the ability to learn scale-variation, so as to be able to handle scale-variation, the other is to reduce the data in the face of Scale-variation, This is the equivalent of giving the task to simplified. The author adopts the latter one, which can be said to be simple and rough, or it can be said that the symptom is not a cure.
- If you want to give the CNN scale invariance, still need to consider what kind of structure is designed to consider the size of invariance, and how to extract from the data or to learn that invariance.
- In addition to scale invariance, CNN can not learn the rotation invariance, if your target dataset inside rotation invariance is important, then consider taking the same action as this article.
Feelings
I like this article very much, it gives us those who do the application a clear how to do applied research paradigm. By carefully analysing the reasons behind the existing problems, and then finding the means to solve the problem, instead of stacking some fancy fashionable things, it is a good example for me to learn??。
Reference
Here are some references to writing this note, which is a great help in trying to understand SNIP.
[1] CVPR18 Detection article selection (bottom)
[2] Target detection paper reading: An analysis of the scale invariance in Object Detection–snip
[3] Target detection-sniper-efficient multi-scale Training-paper notes
If you feel that my article is helpful to you, I would like to give you a small donation, your encouragement is my long-term commitment to a big motivation.
An analysis of the scale invariance in Object Detection–snip paper interpretation