Minimalist notes Deepid-net:object detection with deformable part Based convolutional Neural Networks
Paper Address Http://www.ee.cuhk.edu.hk/~xgwang/papers/ouyangZWpami16.pdf
This is the CUHK Wang Xiaogang group 2017 years of a tpami, the first hair in the CVPR2015, increased after the experiment to cast the journal, so the contrast experiment are some alexnet,googlenet and other early network models, FASTER-RCNN has not yet appeared. This article was selected because you wanted to see how deformable part method (DPM) combined with CNN.
Article Core contribution: 1. New target detection network architecture; 2. Modified the Pretrain setting to improve performance; 3. Replace max-pooling layer with def-pooling layer, which is a combination of DPM and CNN. Pipeline See figure
The author thinks that it is difficult to classify objects in the box only when they are tested, such as a small volleyball, which may be confused with the texture of the swimming cap that the swimmer wears on his head. At this time need the whole picture of the global information, when found volleyball in the volleyball court, swimming cap appears in the pool, then detection classification will be more accurate, and not because of local texture and be misled.
Many of the detection networks are now classified tasks on the Pretrain, the article believes that these two tasks are very different K, classification tasks need to be insensitive to the location scale, and the detection task on the location scale sensitive, so can not directly mechanically. The article uses the 1000 class data of imagenet Cls-loc to carry on the Pretrain, then carries on the fine-tuning in the 200 kind of test data set, obtains the better effect. The
article holds that every channel in the CNN middle tier is actually a response graph of a part of an object. The HOG+DPM process is very similar, so the author adds the idea of DPM to CNN and proposes the def-pooling layer for DPM computing. The feature map for the C channel is Mc M C M_c, its first (I,J) (I, J) (I,j) pixel is M (i,j) c m C (i, J) m_c^{(I,j)}, and the response value is M (x,y) c m C (x, y) m_c^{(X,y)}. The Anchor center coordinates are (x,y) (x, y) (x,y), the pixels on the anchor are offset (δx,δy) (Δx, Δy) (\delta_x,\delta_y), and the absolute coordinates of the offset pixels are zδx,δy= (x,y) t+ (δx,δ Y) T zδx, Δy = (x, y) T + (Δx, Δy) T z_{\delta_x,\delta_y}= (x,y) ^t+ (\delta_x,\delta_y) ^t. Φ (δx,δy) =