This paper really let me love and hate, can be said that I have seen the most serious is also the number of several paper, first deformable conv thought I feel very good, through the idea of end-to-end to do this thing is also extremely made sense, But always feel where there is a problem, before said not up, recently figured out a few points, first preliminary say a few words, and so put their code run through and experiment good own several ideas can come again to chat. First of all, I do semantic segmentation, so I just want to talk about this problem.
Reading this paper directly may feel that Ji Feng's work is very good, but in fact it seems to me a little more gimmick (I completely subjective nonsense), deformable Conv is the combination of STN and DFF two work, the former provides bilinear The idea of sample and the specific BP, the latter provides warp ideas and methods, but it seems to say is not very accurate. I understand for the time being this: deformable conv is to replace the flow of deep feature flow with a learning offset. Next is divided into bright spots and slots to say.
First, highlights
The bright spot to tell the truth is still a lot of, first solves the STN (spatial transform network) practicality question, because STN is to do transform action to the whole feature map, for example learns a linear transform's Matrix, this when doing minist of course is extremely reasonable, but in the real world, this action not only unreasonable and insignificant, because the complex scene of the information a lot, the background is also many, then it is how to do it?
First of all, I want to say a very important misunderstanding, many people think deformable Conv Learning is a deformabe kernel, for example, is a 3*3 connected kernel, and finally turned into a no position has a offset kernel. This is not the case, the author did not learn to kernel offset, but to feature each position to learn an offset, step by step explanation is: First, there is a primitive feature map F, on top of the channel is 18 of the 3*3 convolution, Get channel=18 feature map F_offset, and then deformable Conv to F and the value of the input offset f_offset, on the newly obtained results, each value corresponds to the original feature map F is from a On the kernel of 3, each value on the 3*3 on the corresponding F of each value has two offset in the X and Y directions, and this 3*3*2=18 value is determined by the F_offset just passed in .... Seems to say a bit around, in fact, a clear key point is: Learn the offset is channel=18 and the original feature map size, corresponding to the main branch do deformable Conv The offset at each position of the kernel on each position.
There was a man who said something I particularly agree with: replacing the weight method with a bilinear method, that is, the method of replacing weights with sampling. This thinking can be divergent to do more work, this is also I think this paper the best place.
Second, slot point
This is actually I write the focus of the blog today. .. I can learn to offset is extremely not optimistic, although the final look at the results of the experiment and the actual result, when I want to say two points.
1, from the needs of feature, senmantic segmentation for the feature of the demand is different from detection, the problem actually Jifeng Dai and kaiming R-FCN have mentioned, and then semantic Segmentation need feature do not pay too much attention to what rotation translation invariance, that is, the rotation of the object has an impact on the results, they are care for position, this problem has time I want to see R-FCN discussion, Therefore, it is doubtful that I can learn offset by using feature directly through a layer of convolution.
2. The above suspicion is actually a bit unreasonable, this time there is a slightly lost suspicion, bilinear sample is actually a piecewise linear function, so logically in BP, you want to let loss drop words, You can not let your step is too large to exceed the current linear range, that is, you are in the current four points to calculate the gradient, if you update and jump to another four points up, theoretically this time the gradient update is wrong, loss is not necessarily down, but the words back, If you do not jump to another four points, this offset is always limited to the current four points, it is meaningless. Say again, because the whole feature map is still smooth, this is also related to the nature of the image, so we still believe that as long as your LR is not very large, loss will fall.
Iii. Summary
In general this is a very meaningful work, in my opinion, any can be inspired by the work and to cause people to think about the work is very meaningful, no matter how it works, how to run in benchmark.
There's something I want to wait until the experiment is over.
Thesis discussion && thinking "deformable convolutional Networks"