Reprinted from: http://blog.csdn.net/cv_family_z/article/details/52438372
https://www.arxiv.org/abs/1608.08021
In this paper, a variety of target detection for the problem, combined with the current technical achievements, to achieve a good result.
We obtained solid results on well-known object detection benchmarks:81.8% MAP (mean average precision) on VOC2007 and 82. 5% MAP on VOC2012 (2nd place) while taking only 750ms/image on Intel i7-6700k CPU with a single core and 46ms/image on NV Idia Titan X GPU. Theoretically, our network requires only 12.3% of the computational cost compared to RESNET-101, the winner on VOC2012
For the overall detection framework: CNN feature extraction + region proposal + RoI classification
We mainly optimize feature extraction because region proposal part is faster and does not occupy any time. The classification section can effectively compress the model complexity through SVD. Our design principle is: Less feature types, more points layer. Less channels with more layers. The design network uses concatenated Relu, Inception, and Hypernet, which are trained with batch normalization, residual connections, and learning rate Duling based on plateau detection.
2 Details on Network
2.1 C.relu:earlier Building blocks in feature generation
C.relu is mainly used in the first several layers of the convolution, reducing the output channel half, and then by taking the negative to get the corresponding output channel, which will increase the speed of one times.
C.relu reduces the number of output channels by half, and doubles it through simply concatenating the same outputs with Negatio N, which leads to 2x speed-up of the early stage without the losing.
2.2 Inception:remaining Building blocks in feature generation
Inception for small targets and large targets can be a good solution, mainly by controlling the volume of nuclear dimensions to experiment.
2.3 Hypernet:concatenation of Multi-scale intermediate outputs
It is mainly to combine the convolution feature layer of different scales. Multi-scale target detection is possible.
2.4 Deep Network Training
Here we join residual structures between the inception layers. Add the Batch normalization layer before all the Relu activation layers. Based on plateau detection dynamic control learning rate.
3 Faster r-cnn with our feature extraction network
We combine the convolution 3_4 layer (sample), the convolution layer 4_4 the convolution layer 5_4 (upper sampling) to the 512-channel multi-scale output feature as the input of the faster R-CNN model.
Three Intermediate outputs from Conv3_4 (with down-scaling), Conv4_4, and Conv5_4 (with up-scaling) are to the 512-channel multi-scale Output Features
4 Experimental results