Faster RCNN Study Record

Source: Internet
Author: User

the Faster r-cnn:towards Real-time Object Detection with region proposal Networks " Shaoqing Ren, kaiming He, Ross Girshick, Jian Sun

--Learning data record (Simon John)

The article intends to solve the problem (Towards real-time)

The method of extracting proposal (s) from SPP net and fast r-cnn is very time consuming. Therefore, the author proposes that the Regional extraction Network (region proposal Networks, RPN), which and the detection network share full-image convolution feature, hardly take the time.

1. Overview of methods

Faster rcnn = Fast rcnn + RPN .

The authors propose RPN extract proposal, a convolution (conv) feature mapping used in Fast r-cnn ( Note: The author uses ZF and VGG-16 the last convolutional layer), add two additional convolution layers, construct the RPN: (1) the feature encoding for each convolution mapping location (i.e., the area after the sweep window is mapped to the original) (similar to the convolution process) is a short one (for example , 256- D) eigenvectors , (2) The second layer at each convolution mapping location, output K-Proposal of various scales and aspect ratios at this position the objectness score and the regression boundary (k=9 is typical).
RPN is a full convolutional network (Fully-convolutionalnetwork, FRN) that can be trained end-to-end for the task of generating detection proposals. In order to unify RPN and fast R-CNN[5] target detection network, the author proposes a simple training scheme, that is, to keep proposals fixed, fine-tune proposal and fine-tuning the target detection . This solution converges quickly and finally forms a standard network that allows two tasks to share convolution features.

2.RPN Details

RPN the feature of an image (any size) as input, the output rectangle proposal The collection, each box has a objectness score.

RPN and FASTR-CNN share the calculations, so the two networks share a series of convolutional layers . Slide the Small network on the convolution feature map of the last shared convolution output, which is fully connected to the spatial window of the input convolution feature map NxN ( n=3 in this article ). each sliding window is mapped to a low-dimensional vector ( for ZF is 256-d, for Vgg is 512-d, a sliding window for each feature map corresponds to a numeric value ).

This vector is output to two siblings of the fully connected layer-the bounding box regression layer (REG) and the bounding box classification layer (CLS).       Note that the effective perception of the image is large (ZF is 171 pixels, Vgg is 228 pixels). Figure 1 (left) for example. Note that because the small network is the form of a sliding window, all positions are used to calculate the nxn of the inner product with the same layer parameters (article n=3) . This structure is implemented as a NXN convolution layer, followed by two peers of the 1x1 convolution layer (the meaning of fully connected network, respectively, corresponding to the Reg and CLS), Relu applied to the output of nxn convolution layer.

Figure 1 : Left: Proposal Network ( RPN ). Right: Use RPN proposals in the PASCAL VOC Test set of the detection example. Targets can be detected in a wide range of scales and aspect ratios.

remark: Anchor (three windows of the structure are not three different window sweep windows , Sweep window is fixed 3x3 )

At the position of each sliding window, simultaneously predicts K proposals. Then: thereg layer has 4k output (that is, the coordinate encoding of the K box). CLS Layer Output 2k a score (that is, for each proposals is the target/non-target estimation probability, the article is for simplicity, is the CLS layer implemented with the class two Softmax layer, the author mentions: You can use logistic regression to generate K scores).

K proposals boxes are parameterized by the corresponding K- anchor box. Each anchor is centered at the center of the current sliding window, mapped to an area of the original, and corresponds to a scale and aspect ratio in the center of the area, and the article uses 3 scales and 3 aspect ratios (1:1;1:2;2:1), then there are k=9 anchor at each sliding position.

Implementation Details

For anchor, with 3 simple scales, the bounding box area is 128x128,256x256,512x512, and 3 simple aspect ratios, 1:1,1:2,2:1.

( Note: In predicting large proposals , the algorithm in the author's text considers the use of anchor bounding boxes that are larger than the basic sensing field .) These predictions are not impossible - - as long as you can see the middle part of the target, or roughly deduce the scope of the goal . I didn't see the words.

With this design, the solution does not require multi-scale features or multi-scale sliding windows to predict large areas, saving considerable running time. Figure 1 (right) shows the algorithm's ability to handle multiple scales and aspect ratios. The following table is the average proposals size (s=600) learned for each anchor with the ZF network.

( Note: The anchor bounding box across the image boundary should be handled with care.) In training, we ignore all anchor that cross the boundary of the image so that they do not affect the loss .

The corresponding relationship between the feature map and the original image needs to be added next!

Example: for a typical 1000x600 image, there is almost a total of 20k (~60x40x9) anchor. After ignoring the cross-border anchor, only 6k of each image is left to be trained anchor. If the outliers that cross the boundary are not ignored during training, they will bring large and difficult correction error items, and the training will not converge. At the time of testing, the full convolution of the RPN is applied to the entire image, which may generate proposals across the boundary and then crop it to the edge of the image.

Proposals and other proposals generated by RPN will have a significant overlap, with non-maximum suppression (non-maximumsuppression, NMS) in order to reduce redundancy. The fixed IOU threshold for NMS is 0.7, so that only 2k of proposals is left for each image. The NMS does not affect the final detection accuracy rate, but significantly reduces the number of proposals. After NMS, use top-n in proposals to detect.

3 Training Faster RCNN Network

3.1 Training RPN

(RPN=CLS layer and Reg layer, Note:256-d features directly these two fully connected layers, first to the moment Slide window to get the proposal The cls, entered is the moment proposal score, a node; if the target, in the loss of Reg layer for the return of coordinates, the output is four nodes, note Reg layer or the moment. Proposal corresponding to the 256-d feature. that is, the CLS and reg layers Use the same characteristics. )

for CLS Layer

Assign each anchor a binary label (target or background).

A positive label is given to two classes of anchor:(i) with a groundtruth (GT) bounding box having the highest IOU (intersection-over-union, Overlap Anchor ( Note: Perhaps less than 0.7);

( II ) with any GT bounding box with a IOU overlap of more than 0.7 anchor.

( Note: a GT bounding box may be assigned a positive label to multiple anchor.) Assigning negative labels to all GT enclosures with a IoU ratio of less than 0.3 anchor). ( Note: non-positive anchor do not have any effect on the training target .)

Then the loss function for an image is defined as:

( here,i is an index of anchor in a mini-batch ,Pi is Anchor I is the predicted probability of the target. If the anchor is positive, theGT label pi* is 1, If anchor is negative,pi* is 0. ti is a vector that represents the 4 parameterized coordinatesof the predicted bounding box, and the ti* is The anchor corresponds to the coordinate vector of the GT bounding box. )

Classification Loss *lcls is a two category (Target vs. logarithmic loss of non-target) .

for regression loss *, used to calculate, where R is a robust loss function (smooth L1).

Pi* Lreg This means that only the positive anchor (pi* =1) has a return loss, and no other case (pi* =0). The outputs of the CLS layer and the Reg layer are made up of {pi} and {ti} respectively, which are normalized by ncls and nreg and a balance weight λ (in the earlier implemented and exposed code, the normalized value of the Λ=10,CLS item is the size of Mini-batch, which is ncls= The normalized value of the 256,reg item is the number of anchor positions, that is, nreg~2,400, so that the CLS and reg entries are almost equal weights.

For regression

It is understood that the bounding box is returned from the anchor bounding box to the nearby GT bounding box.

BoundingBox details, please see http://caffecn.cn/?/question/160 specific derivation and explanation have been uploaded to the Tower. (Please see for interested)

Each anchor corresponds to a regressors. (Note: Each regression amount corresponds to the anchor of a scale and aspect ratio , and The weights are not shared between the corresponding regression amounts of K-anchor.) Therefore, even if the feature has a fixed size / scale, it is still possible to predict the bounding box of various sizes .

4. Training Faster RCNN Network

( Note:proposal shares convolution features with target detection (Fast rcnn))

RPN and FASTR-CNN are all independently trained to modify their convolution layers in different ways. Therefore, the authors propose a technique that allows the sharing of convolutional layers between two networks , rather than learning two networks separately. ( Note that this is not just a separate network that contains RPN and Fast r-cnn , and then it's as simple to optimize it with reverse propagation.) reason is Fast r-cnn training depends on a fixed target proposals , and it is not clear that when changing the proposed mechanism at the same time, learning Fast r-cnn will not converge. Although this joint optimization is an interesting question in the future work.

The author developed a practical 4 - Step Training algorithm (at the end of the author's introduction)to learn the shared features by alternating optimization .
The first step , according to the above training RPN, the network with imagenet pre-trained model initialization, and end-to-end fine-tuning RPN for proposal tasks.

The second step , we use the first step of the RPN generated proposals, by the fast R-CNN training a separate detection network, fast R-CNN detection network is also by the Imagenet pre-trained model initialization, At this time, two networks have not shared the convolution layer.

The third step , with the detection network fast R-CNN and then RPN training, but fixed the shared convolution layer, and only fine-tuning RPN unique layer, now two network shares convolutional layer.

Fourth Step , keep the shared convolution layer fixed and fine tune the other layers of fast r-cnn. In this way, the two networks share the same convolution layer and form a unified network.

4. Optimization

The RPN is a full convolutional network, with end-to-end training through reverse propagation (BP) and random gradient descent (SGD). Follow the "image-centric" sampling strategy to train this network. Each mini-batch consists of a single image that contains a number of positive and negative samples.

Randomly sampling the anchor in an image , calculating the loss function of the mini-batch, where the ratio of the positive and negative anchor sampled is 1:1. If the number of positive samples in an image is less than 128, a negative sample is used to fill the mini-batch.
All new layers are randomly initialized (the layer behind the last convolution layer) by the weights obtained from the Gaussian distribution with a zero-mean standard deviation of 0.01, and all other layers (that is, shared convolutional layers) are initialized by pre-trained models of the imagenet classification.

(The author 's learning rate for 60k mini-batch on a PASCAL dataset for 0.001, The learning rate for the next 20k mini-batch It's 0.0001. Momentum is 0.9, weight attenuation is 0.0005)

Note: Faster rcnn
Fasterr-cnn:towards real-time Object Detection with region proposal Networks
Https://github.com/rbgirshick/py-faster-rcnn

Https://github.com/ShaoqingRen/faster_rcnn

Faster RCNN Study Record

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.