Ren, Shaoqing, et al. "Faster r-cnn:towards Real-time object detection with region proposal networks." Advances in neural information processing Systems. 2015.
After Rcnn[1],fast Rcnn[2], this article is another masterpiece of the Ross Girshick team, the leader of the target detection community in 2015. The detection speed of Simple network target is 17fps, the accuracy of Pascal VOC is 59.9%, the complex network reaches 5fps, the accuracy rate is 78.8%.
The author gives the source code based on MATLAB and Python on GitHub. For the region CNN algorithm does not understand the classmate, please first see these two articles: "RCNN algorithm detailed", "Fast RCNN algorithm detailed".
Thought
From rcnn to fast rcnn, to faster rcnn in this paper, four basic steps of target detection (candidate area generation, feature extraction, classification, location refinement) are finally unified into a deep network framework . All calculations are not duplicated and are fully completed in the GPU, greatly improving the speed of operation.
Faster rcnn can simply be seen as a "zone-generated network +fast RCNN" system that replaces the RCNN search method in fast selective with a region-generated network. This paper focuses on solving three problems in this system:
1. How to design a zone generation network
2. How to train area generation networks
3. How to enable zone generation networks and fast RCNN network sharing feature extraction networks
Zone Generation Network: Structure
The basic assumption is that all possible candidate frames are judged on the extracted feature map. The candidate boxes are actually sparse because of the location refinement steps that follow.
Feature Extraction
The original feature extraction (gray box) contains several layers of conv+relu, which can be applied directly to the common classification networks on the imagenet. This paper tests two kinds of networks: the ZF[3],16 layer of the 5-layer vgg-16[4], the concrete structure is no longer described.
Add an extra conv+relu layer and output the 51*39*256 Dimension feature (feature).
Candidate Area (anchor)
The feature can be viewed as a 256-channel image of a scale 51*39, considering 9 possible candidate windows for each location of the image: three sizes { 2 , 2 , 2 }x Three different proportions {1:1,1:2,2:1} 。 These candidate windows are called anchors. Shows the 51*39 Anchor Center, as well as 9 examples of anchor.
In the whole faster RCNN algorithm, there are three kinds of scales.
Original scale : The size of the raw input. No restrictions and no performance impact.
normalized scale : the size of the input feature extraction network, set in the test, the source code opts.test_scale=600. The anchor is set on this scale. The relative size of this parameter and the anchor determines the target range to be detected.
Network Input Scale : the size of the input feature detection network, set in training, the source code for 224*224.
Window classification and location refinement
Classification Layer (Cls_score) output each position, 9 anchor is the probability of the foreground and background; the window regression layer (bbox_pred) outputs each location, and 9 anchor corresponding Windows should translate the scaled parameters.
For each location, the classification layer outputs the probability of the foreground and background from the 256-D feature, and the window regression layer outputs 4 translation scaling parameters from a 256-D feature.
In part, these two layers are fully connected networks, as a whole, because the network in all locations (a total of 51*39) parameters are the same, so the actual size of the 1x1 convolution network implementation.
It is important to note that there is no explicit extraction of any candidate Windows and complete judgment and correction using the network itself.
Zone Generation Networks: training samples
Examine each image in the training set:
A. For each calibrated truth candidate region, the anchor with the largest overlap is recorded as a foreground sample
B. For a) the remainder of the anchor, if it overlaps with a certain calibration ratio greater than 0.7, recorded as a foreground sample; if the overlap ratio of any one calibration is less than 0.3, it is recorded as a background sample
C. to a), B) the remaining anchor, discarded.
D. Anchor to cross the image boundary
Cost function
At the same time, minimize the two kinds of costs:
A. Classification error
B. The window position deviation of the foreground sample
See the section "Classification and position adjustment" in fast rcnn for details.
Hyper-Parameter
The original feature extraction network is initialized with the Imagenet classification sample, and the remaining new layers are randomly initialized.
Each mini-batch contains 256 anchor extracted from an image, and a foreground background sample 1:1.
The first 60K iteration, learning rate 0.001, after 20K iteration, learning rate 0.0001.
Momentum set to 0.9,weight decay is set to 0.0005. [5]
Shared Features
Both the Zone Generation Network (RPN) and fast rcnn require an original feature extraction network (gray box). This network uses Imagenet's classification library to get the initial parameters < Span class= "Mi" id= "mathjax-span-35" style= "font-family:mathjax_math-italic;" >w 0 Span style= "Display:inline-block; width:0px; Height:2.349em; " > , but how to fine tune the parameters, so that they meet the needs of both sides? This article explains three ways.
Take turns training
A. From < Span class= "Mi" id= "mathjax-span-40" style= "font-family:mathjax_math-italic;" >w 0 Span style= "Display:inline-block; width:0px; Height:2.349em; " > Start, train RPN. Extracting candidate areas on the training set with RPN
B. From < Span class= "Mi" id= "mathjax-span-45" style= "font-family:mathjax_math-italic;" >w 0 Span style= "Display:inline-block; width:0px; Height:2.349em; " > Start by training fast rcnn in the candidate area, with the parameters recorded as < Span class= "Mi" id= "mathjax-span-50" style= "font-family:mathjax_math-italic;" >w 1 Span style= "Display:inline-block; width:0px; Height:2.349em; " >
C. From < Span class= "Mi" id= "mathjax-span-55" style= "font-family:mathjax_math-italic;" >w 1 Span style= "Display:inline-block; width:0px; Height:2.349em; " > Start, train RPN ...
When you do this, only two iterations are performed and some layers are frozen during training. The experiment in this paper uses this method.
As Ross Girshick in the ICCV 15 lecture training R-cnns of various velocities, there is no root cause for this approach, mainly because of "implementation issues and deadlines."
Approximate joint training
Train directly on the structure. When the backward calculates the gradient, the extracted ROI region is treated as a fixed value; When the backward updates the parameters, the incremental merge input from RPN and from Fast rcnn the original feature extraction layer.
This method works like the previous method, but reduces the training time by 20% to 25%. This method is included in the advertised Python code.
Joint training
Train directly on the structure. However, when backward calculates gradients, it is necessary to consider the effects of changes in ROI regions. The derivation is beyond the scope of this article, please refer to the 15 Nip paper [6].
Experiment
In addition to the basic performance mentioned in the opening, there are some notable conclusions
-
The recall rate for the RPN method (red-blue) in this article is less than the selective search method (black) When the candidate region generated by each graph is reduced from 2000 to 300. Explains The RPn method is more purposeful . The
-
is trained in the larger Microsoft Coco Library [7] and is tested directly on the Pascal VOC, increasing the accuracy rate by 6%. Description faster rcnn Good migration , no over fitting.
- Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE Conference on Computer vision and pattern recognition. 2014.?
- Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on computer Vision. 2015.?
- M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional neural Networks," European Conference on Compu ter Vision (ECCV), 2014. ?
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," International CONFE Rence on Learning Representations (ICLR), 2015. ?
- The learning rate-controls the relationship between the increment and the gradient, momentum-maintains the increment of the previous iteration, and the weight decay-shrinks the parameters for each iteration, which is the equivalent of regularization.
- Jaderberg et al. "Spatial Transformer Networks"
NIPS 2015?
- 300,000 + image, 80 class detection library. See http://mscoco.org/.?
"Target detection" Faster RCNN algorithm detailed