Original source

Thank the Author ~

Faster r-cnn:towards Real-time Object Detection with region Proposalnetworks

shaoqing Ren, kaiming He, Ross girshick, Jian SuN

Summary

At present, the most advanced target detection network needs to use the region proposed algorithm to speculate on the target location, such as sppnet[7] and fast r-cnn[5] These networks have reduced the running time of the detection network, then the calculation of the region is a bottleneck problem. In this article, we introduce a regional recommendation network (region proposal networks, RPN), which shares full-image convolution features with the detection network, making the regional recommendations hardly take time. RPN is a full convolutional network that predicts both the target boundary and the objectness score at each location. RPN is an end-to-end training, producing high-quality area recommendations boxes for fast r-cnn to detect. With a simple alternate operation optimization method, RPN and fast r-cnn can share the convolution feature during training. For very deep VGG-16 models [19], our detection system has a frame rate of 5fps (including all steps) on the GPU, achieving the highest target detection accuracy at Pascal VOC 2007 and Pascal VOC 2012 (2007 is 73.2%map, 2012 is 70.4%map), with 300 suggestion boxes for each image. The code is already exposed.

1. Introduction

Recent advances in target detection have been driven by the success of the region-recommended approach (e.g. [22]) and the region-based convolutional neural Network (R-CNN) [6]. The region-based CNN was very computationally expensive when it was first presented in [6], but fortunately later this consumption was greatly reduced by sharing convolutional [7,5] between the suggested boxes. The nearest fast R-CNN[5] implemented a near real-time detection rate with very deep network [19], noting that it ignores the time it takes to generate a zone suggestion box. Now, the **recommendation box is a computational bottleneck in the most advanced detection system** .

the region recommendation method typically relies on the consumption of small features and economic access schemes. Selective search (selective search, SS) [22] is one of the most popular methods, and it is based on the low-level features of design greedy to fuse super-pixel. Compared to the efficient detection network [5], the SS is one order of magnitude slower, with approximately 2s per image in the CPU application. EDGEBOXES[24] made the best tradeoff between the quality and speed of the proposed box, about 0.2s per image. In any case, however, the Zone recommendation step takes about the same amount of time as the test network.

Fast R-CNN leverages the GPU, and the area recommendation method is implemented on the CPU, and this runtime comparison is unfair. A significant speed-up the method of generating a suggestion box is to implement it on the GPU, which is an engineering effective solution, but this approach ignores the subsequent detection network and therefore misses the important opportunity for shared computing.

in this article, we have changed the algorithm-- **using the depth network to calculate the suggestion box** --which is a simple and effective solution, and the proposed box calculation will hardly consume the computation of the detection network. For this purpose, we introduce the novel Regional advice Network (region proposal Networks, RPN), which shares the convolution layer with the most advanced target detection network [7,5]. When testing, the marginal cost of calculating the suggestion box is small (for example, 10ms per image) by sharing the convolution.

we observed that region-based detectors such as the convolution (conv) feature map used by Fast r-cnn can also be used to generate regional recommendations. We then add two additional convolution layers to the convolution feature, constructing RPN: The first layer **encodes each convolution mapping position into a short (e.g.** **256-d) eigenvector** , the second layer **in each convolution map location, **the **proposed objectness score and regression boundary** (k=9 is typical) for the K-regions that output multiple scales and aspect ratios at this location.

our RPN is an **all-convolutional network** (fully-convolutional networks, FRN) [14] that can be trained end-to-end for the task of generating the inspection recommendations box. In order to unify the RPN and fast R-CNN[5] target detection networks, we present a simple training scheme, which is to **keep the suggestion box fixed, the fine tuning area suggestion and the fine tuning target detection alternating** . This solution converges quickly and finally forms a standard network that allows two tasks to share convolution features.

we evaluated our approach on the Pascal VOC Test standard set [4], and the fast r-cnn combined with RPN's detection accuracy exceeded the method of fast R-CNN binding SS as a strong benchmark. At the same time, our approach does not have the computational burden of the SS test, and the effective run time for the build recommendation box is only 10 milliseconds. Using the deep model of the network in [19], our detection method still has a 5fps frame rate on the GPU (including all steps), so the speed and accuracy (PASCAL VOC 2007 on the 73.2%map,pascal VOC 2012 is 70.4%), This is a practical target detection system.

2. Related work

Several recent articles have proposed methods for determining or class-indeterminate bounding boxes using depth network positioning classes [21, 18, 3, 20]. In the Overfeat method [18], the full connection (FC) layer is trained, and the bounding box coordinates are predicted for the positioning task assuming only one target. The FC layer is then transferred to the convolution layer to detect targets determined by multiple classes. Multibox method [3, 20] from the last FC layer at the same time to predict multiple (such as 800) bounding box of the network to generate regional recommendations, R-CNN[6] is the use of this. Their suggestion box network is applied to a single image or to the cutting portion of multiple large images (e.g. 224x224) [20]. We will discuss overfeat and Multibox in more depth when we talk about our approach in the following article.

The shared computing of convolution [18, 7, 2, 5] is efficient and accurate, and has attracted more and more attention in the field of visual recognition. Overfeat paper [18] calculates convolution features from image pyramids for classification, positioning, and detection. The adaptive size of the pooling (SPP) [7] on a shared convolution feature map can be effectively used for region-based target detection [7, 16] and semantic segmentation [2]. Fast R-cnn[5] Implements an end-to-end detector trained on shared convolution features, showing amazing accuracy and speed.

3. Regional recommendations Network

The Region recommendation Network (RPN) takes an image (any size) as input, outputting a collection of rectangle target suggestion boxes with a objectness score for each box. We use the full convolutional network [14] to build a model for this process, which is described in detail in this chapter. Since our ultimate goal is to share the calculation with fast R-CNN target detection network [15], it is assumed **that the two networks share a series of convolutional layers** . In the experiment, we studied the model of Zeiler and Fergus (ZF) in detail, which has 5 shareable convolution layers, and the model of Simonyan and Zisserman (VGG), which has 13 shareable convolution layers.

To generate the zone suggestion box, we slide the small network on the convolution feature map of the last shared convolution output, which is fully connected to the NxN spatial window of the input convolution feature map. **each sliding window is mapped to a low-dimensional vector** (for ZF is 256-d, for Vgg is 512-d, a sliding window for each feature map corresponds to a numeric value). This vector is output to a fully connected layer of two siblings-the bounding box regression layer (REG) and the bounding box classification layer (CLS). In this article n=3, note that the effective perception of images is very large (ZF is 171 pixels, Vgg is 228 pixels). Figure 1 (left) takes an example of this small network at a certain location. Note that because the small network is the form of a sliding window, the fully connected layer (NXN) is shared by all spatial locations (that is, the same layer parameters that all locations use to calculate the inner product of the NXN). This structure is implemented as a NXN convolution layer, followed by a two-sibling 1x1 convolution layer (corresponding to Reg and CLS respectively), relu[15] applied to the output of the NxN convolution layer.

*Figure 1: Left: Area recommendation Network (RPN). Right: Use the RPN recommendation box on the Pascal VOC 2007 test set on the detection example. Our approach can detect targets in a wide range of scales and aspect ratios.*

Pan-invariant anchor

in the position of each sliding window, we also predict the K-region recommendations, so the **reg layer has 4k output** , that is, the K-box coordinate coding. The **CLS layer outputs a 2k score** , which is the estimated probability of the target/non-target for each suggestion box (for simplicity, a CLS layer implemented with the class two Softmax layer, and a logistic regression to generate a K-score). The K suggestion box is parameterized by the corresponding K-called Box **anchor** . Each anchor is centered on the center of the current sliding window and corresponds to a scale and aspect ratio, and we use 3 scales and 3 aspect ratios so that there are **k=9** anchor at each sliding position. For convolution feature mappings with a size of wxh (typically about 2,400), there are WHK anchor in total. One of the important features of our approach is **Translational** invariance, which is true for anchor and for functions that calculate anchor corresponding suggestion boxes.

As a comparison, the Multibox method [20] generates 800 anchor with a k-means, but does not have translational invariance. If you pan the target in the image, the suggestion box should also be panned, and you should be able to use the same function to predict the suggestion box. In addition, because the Multibox anchor does not have translational invariance, it requires (4+1) x800-d output layer, and our method as long as (4+2) x9-d the output layer. Our suggestion box layer is less than an order of magnitude parameter (Multibox with GOOGLELENET[20] requires 27 million vs. RPN with VGG-16 requires 2.4 million), so there is less risk of overfitting on a small data set such as Pascal VOC.

Loss function for learning area recommendations

to train RPN, we assign each anchor a binary label (not the target). We assign

**positive labels** to two classes of anchor: (i) with a groundtruth (GT) bounding box having the highest IOU (intersection-over-union, Overlap The anchor (perhaps less than 0.7), and (ii) with any GT bounding box having a IOU overlapping anchor of more than 0.7. Note that a GT bounding box may be assigned a positive label to multiple anchor. We assign

**negative labels** to all GT enclosures with a IOU ratio of less than 0.3 anchor. Non-positive anchor have no effect on the training target.

with these definitions, we follow the multi-task loss in fast R-CNN[5] and minimize the target function. We define the loss function of an image as

here, I is an index of anchor in a mini-batch, and pi is the predictive probability that anchor I is the target. If the anchor is positive, the GT label pi* is 1, if anchor is negative, pi* is 0. TI is a vector that represents the 4 parameterized coordinates of the predicted bounding box, and the ti* is the coordinate vector of the GT bounding box corresponding to the positive anchor.

**Classification Loss**
**lcls is a logarithmic loss of two categories (target vs. non-target). For* regression loss *, we are used to calculate, where R is the robust loss function defined in [5] (smooth L1).

pi* Lreg This means that only positive anchor (pi* =1) have a return loss, and no other case (pi*=0). The outputs of the CLS layer and the Reg layer are made up of {pi} and {ti} respectively, which are normalized by ncls and nreg and a balance weight λ (in the earlier implemented and exposed code, the normalized value of the Λ=10,CLS item is the size of Mini-batch, which is ncls= The normalized value of the 256,reg item is the number of anchor positions, that is, nreg~2,400, so that the CLS and reg entries are almost equal weights.

for regression, we learn [6] using 4 coordinates:

X,y,w,h refers to the center of the bounding box (x, y) coordinates, width, height. The variable x,xa,x* refers to the x-coordinate of the predicted bounding box, the bounding box of the anchor, the bounding box of the GT (same as for y,w,h). It can be understood that the bounding box is returned from the anchor bounding box to the nearby GT bounding box.

In any case, we have implemented a bounding box algorithm with a different method than the previous one based on the feature mapping method [7, 5]. In [7, 5], the bounding box regression is performed on a feature pooling from an area of any size, and the regression weights are all not

Shared with a region of the same size. In our approach, the features used for regression have the same spatial size (NXN) in the feature map. Considering the various sizes, it is necessary to learn a series of k-bounding box regression quantities. Each regression amount corresponds to a scale and aspect ratio, and the weights are not shared between K regression quantities. Therefore, even if the feature has a fixed size/scale, it is still possible to predict the bounding box of various sizes.

Optimization

The RPN is naturally implemented as a full convolution network [14], through reverse propagation and random gradient descent (SGD) [12] End-to-end training. We trained this network by following the "Image-centric" sampling strategy in [5]. Each mini-batch consists of a single image that contains a number of positive and negative samples. We can optimize all the anchor loss functions, but this will be biased towards negative samples because they are primary. Therefore, we randomly sampled the anchor in an image , calculating the loss function of the mini-batch, where the ratio of the **positive and negative** **anchor sampled is 1:1**. If the number of positive samples in an image is less than 128, we fill the mini-batch with negative samples.

[6] We randomly initialize all new layers (subsequent layers of the last convolution layer) by the weights obtained from the Gaussian distribution with a zero-mean standard deviation of 0.01, and all other layers (that is, shared convolutional layers) are initialized by a pre-trained model of the imagenet classification [17]. [5] We adjust all layers of the ZF network, as well as conv3_1, and prepare for the VGG network to conserve memory. We have a **learning rate** of 0.001 for 60k Mini-batch on the Pascal data set, and 0.0001 for the next 20k mini-batch. The **Momentum** is 0.9 and the **weight is attenuated** to 0.0005[11]. Our implementation uses the CAFFE[10].

Region recommendation and target detection shared convolution features

So far, we've described how to train a network for generating area recommendations, without considering region-based targeting to detect how CNN is using these recommendation boxes. For the detection network, we use Fast r-cnn[5], which now describes an algorithm that learns the convolution layer shared between RPN and fast r-cnn.

Both RPN and fast r-cnn are independently trained to modify their convolutional layers in different ways. So we need to develop a technology that allows the **sharing of convolutional layers between two networks** , rather than learning two networks separately. Note that this is not just about defining a separate network that contains RPN and fast r-cnn, and then using reverse propagation to optimize it so easily. The reason is that fast r-cnn training relies on a fixed target recommendation box, and it's unclear whether learning fast r-cnn will converge when changing the proposed mechanism at the same time. Although this joint optimization is an interesting problem in the future work, we have developed a practical 4- **Step training algorithm** to learn the shared features by **alternating optimization** .

The **first step** , we follow the above training RPN, the network with the Imagenet pre-trained model initialized, and end-to-end fine tuning for the regional recommendation task. The **second step** , we use the first step of the RPN generated recommendation box, by fast R-CNN training A separate detection network, the detection network is also initialized by the Imagenet pre-trained model, this time two networks have not shared the convolution layer. In the **third step** , we use the detection network to initialize the RPN training, but we fixed the shared convolution layer, and only fine-tune the RPN exclusive layer, now two network shares convolutional layer. **Fourth Step** , keep the shared convolution layer fixed and fine tune the FC layer of fast r-cnn. In this way, the two networks share the same convolution layer and form a unified network.

Implementation Details

Our training, test area recommendations and target detection networks are all on a single-scale image [7, 5]. We scaled the images so that their short edges s=600 pixels [5]. Multi-scale feature extraction may improve accuracy but is detrimental to the tradeoff between speed and accuracy [5]. We also note that the ZF and VGG networks have a total step size of 16 pixels for the scaled image at the last convolutional layer, which equates to approximately 10 pixels (600/16=375/10) on a typical Pascal image (~500x375). Even such a large step has yielded good results, although the accuracy of the step size may be further improved.

For anchor, we use 3 simple scales, bounding box area of 128x128,256x256,512x512, and 3 simple aspect ratio, 1:1,1:2,2:1. Note that when predicting the big proposal box, our algorithm takes into account the use of a anchor bounding box that is larger than the basic sensing field. These are not impossible--as long as the middle part of the target is visible, the scope of the goal can be roughly inferred. With this design, our solution does not require Multiscale features or multi-scale sliding windows to predict large areas, saving considerable uptime. Figure 1 (right) shows our algorithm's ability to handle multiple scales and aspect ratios. The following table is the average recommended box size (s=600) learned for each anchor with the ZF network.

The anchor bounding box across the image boundary should be handled with care. In training, we ignore all anchor that span the bounds of the image so that they do not affect the loss. For a typical 1000x600 image, there is almost a total of 20k (~60x40x9) anchor. After ignoring the cross-border anchor, only 6k of each image is left to be trained anchor. If the outliers that cross the boundary are not ignored during training, they will bring large and difficult correction error items, and the training will not converge. At the time of testing, we applied the full convolution RPN to the entire image, which could generate a suggestion box across the boundary, which we cropped to the edge of the image.

Some RPN suggestions boxes overlap heavily with other suggestion boxes, and in order to reduce redundancy, we use a **non-maximum value suppression** (non-maximum suppression, NMS) based on the CLS score of the proposed region. We fixed a IOU threshold of 0.7 for the NMS so that only 2k of the proposed area was left for each image. As shown below, the NMS does not affect the final detection accuracy, but significantly reduces the number of recommended boxes. After NMS, we use top-n in the proposed area to detect. In the following, we train fast r-cnn with the 2k RPN suggestion box, but the different number of suggestion boxes are evaluated when testing.

4. Experimentswe evaluated our approach in the Pascal VOC2007 test benchmark [4]. This dataset includes 20 target categories, approximately 5k Trainval images, and 5k test images. We also provide results on the Pascal VOC2012 benchmark for a few models. For the Imagenet pre-training network, we use the "fast" version of the ZF Network [23], there are 5 convolutional layers and 3 FC layers, the exposed VGG-16 model [19], there are 13 convolutional layers and 3 FC layers. We mainly evaluate the average accuracy of the tests (mean Average precision,map) because this is the actual metric for target detection (rather than the proxy metric that focuses on the target recommendation box).

table 1 (above) shows the results of the FASTR-CNN when training and testing using the various area-recommended methods. These results are using the ZF network. For selective search (SS) [22], we generated about 2k of the SS recommendation box in "Fast" mode. For Edgeboxes (EB) [24], we adjust the default EB setting to the 0.7IoU build suggestion box. The map of the SS for 58.7%,eb is 58.6%. RPN and fast r-cnn achieve competitive results when using 300 suggestion boxes when the map has 59.9% (for RPN, the number of recommended boxes, such as 300, is the maximum number of image generation suggestions boxes.) RPN may produce fewer suggestion boxes, so the average number of recommended boxes is also less. Using RPN implements a faster detection system than SS or EB because there is a shared convolution calculation; the recommended box is less and the FC consumption in the area is reduced. Next, we consider some of the RPN's ablation, and then demonstrate the improved quality of the suggested boxes when using very deep networks.

*Table 1 The results of the PASCAL VOC2007 test set (in VOC2007 trainval training). The detector is fast r-cnn with ZF, but uses various advice box methods for training and testing.*

**ablation test. **in order to study the performance of RPN as a suggestion box method, we conducted several ablation studies. First, we demonstrate the impact of shared convolution between RPN and fast r-cnn detection networks. To do this, we stop after the second step in the 4-step training process. The result of using a detached network is slightly reduced to 58.7% (rpn+ ZF, unshared, table 1). We observed that this is because in the third step, the recommended frame quality is improved when the adjusted detector feature is used to fine-tune the RPN.

Next, we cleared up the impact of RPN on training fast R-CNN to detect the network. To do this, we trained a fast R-CNN model with 2k SS suggestion Box and ZF network. We fix this detector by changing the suggested area used in the test to evaluate the detected map. In these ablation experiments, the RPN does not share features with the detector.

replace Ss,map with 300 RPN suggestion boxes for 56.8% when testing. The loss of map is caused by inconsistencies between the training/test suggestions box. The result is used as a benchmark for the following comparisons.

oddly enough, RPN still gets a competitive result (55.1%) when using the top 100 recommended boxes in the test, indicating that the RPN suggestion box for this high-low ranking is accurate. In another extreme case, using the highest ranked 6k RPN recommendation box (no NMS) to obtain a comparable map (55.2%), this indicates that the NMS will not degrade the detection map, but can reduce false positives.

Next, we examine the effect of their output by removing one of the RPN's CLS and Reg separately during testing. When the CLS layer is removed at test time (and therefore not in nms/rankings), we randomly extract n suggestion boxes from areas that do not have a score calculated. When N =1k, the map was almost unchanged (55.8%), but when n=100 was significantly reduced to 44.6%. This indicates that the**CLS score is the highest ranked recommendation box for accurate reasons** .

on the other hand, when the Reg layer is removed during testing (the suggestion box is directly anchor), the map drops to 52.1%. This indicates that the high-quality suggestion box is mainly attributed to the position after the return. The anchor box alone is not sufficient for precise detection.

We also evaluate the role of a more robust network on the quality of the RPN recommendations box. We use VGG-16 to train RPN and still use the SS+ZF detector described above. Map increased from 56.8% (using RPN+ZF) to 59.2% (using Rpn+vgg). This is a satisfactory result, as it shows that Rpn+vgg's suggested box quality is better than RPN+ZF. Since Rpn+zf's suggestion box is available to compete with SS (both training and testing are used in a consistent 58.7%), we can expect Rpn+vgg to be better than SS. The following experiment proves this hypothesis.

**VGG-16** **detection accuracy and running time. **Table 2 shows the results of the VGG-16 for the suggestion box and the test. The result of using Rpn+vgg,fast R-CNN for unshared features is 68.5%, which is slightly higher than the SS benchmark. As shown above, this is because the suggestion box generated by Rpn+vgg is more accurate than the SS. Unlike pre-defined SS,RPN, which are trained in real time, they benefit from a better network. For feature-sharing variants, the result is that the 69.9%--is better than a strong SS benchmark and the recommended box is virtually lossless. We followed [5], the trainval of Pascal VOC2007 Trainval and the further training Rpn,map was 73.2%. As in [5] on the trainval+test of VOC and VOC2012 Trainval, our method has 70.4% map on the Pascal VOC 2012 test set (Table 3).

*Table 2: Detection results on the Pascal VOC 2007 test set, the detectors are fast r-cnn and VGG16. Training data: "": VOC2007 trainval, "07+12": The VOC-trainval and VOC-trainval. For RPN, the recommended box for fast R-CNN training is 2k. This is reported in [5], with a higher number (68.0±0.3 in 6 runs) using the warehouses provided in this article (repository).*

*Table 3:pascal VOC 2012 test Set detection results. Detectors are fast r-cnn and VGG16. Training data: "": VOC-Trainval, "07++12": The VOC-trainval+test and VOC-Trainval's set. For RPN, the recommended box for fast R-CNN training is 2k.*

In table 4 We summarize the running time of the entire target detection system. SS takes 1-2 seconds, depending on the image content (average 1.51s), fast r-cnn with VGG-16 requires 320ms on 2k SS Recommendation Box (if SVD is used in FC layer only 223ms[5]). We use the VGG-16 system to generate a suggestion box and detect a total of only **198ms**. When the convolution layer is shared, the RPN only uses 10ms to calculate the additional layers. Due to the small number of suggestions (300), our regional calculation costs are also very low. Our system has a frame rate of 17fps when using ZF networks.

*Table 4:k40 the time (ms) on the GPU, except that the SS recommendation box is evaluated in the CPU. "Regional aspects" include NMS,POOLING,FC and Softmax. See our analysis of the published code run time.*

**analysis of IOU recall rate. **Next, we calculate the recall rate for the proposal box with the GT box at different IOU ratios. It is noteworthy that the IOU recall rate metric is related to the final detection accuracy only loosely [9, 8, 1]. It is more appropriate to use this metric to diagnose the suggestion box method rather than evaluate it.

In Figure 2, we show the results of using 300,1k, and 2k suggestion boxes. We compare SS with EB, and the N suggestion boxes are based on the top N of the confidence level that are generated using these methods. The figure shows that the RPN method behaves well when the number of suggested frames drops from 2k to 300. This explains why the RPN has a good final detection map when using as few as 300 advice boxes. As we have previously analyzed, this attribute is mainly attributed to the CLS term RPN. When the suggestion box became smaller, the recall rate of SS and EB dropped faster than RPN.

*Figure 2:pascal The recall rate on the VOC 2007 test set vs. IOU overlap rate*

* * Single-level detection vs. Level Two recommendation Box + detection. **overfeat paper [18] proposes a method of detecting regression and classification on a sliding window of convolution feature mapping. Overfeat is a single-stage, class-specific detection process, and ours is a two-level, class-independent method of proposing box and class-specific detection. In Overfeat, the regional feature comes from a sliding window that corresponds to a aspect ratio of a scale pyramid. These characteristics are used to determine the position and category of the object at the same time. In RPN, features are derived from the square (3*3) sliding window and the forecast suggestion box relative to anchor, which are different scales and aspect ratios. Although both methods use sliding windows, the area recommendation task is only the first level of RPN + Fast R-CNN-the detector is committed to improving the suggestion box. At the second level of our cascading approach, the characteristics of the regional level are adaptively pooling[7 from the suggestion box, 5], more like the characteristics of the field coverage area. We believe these features lead to more accurate testing.

in order to compare single-stage and two-stage systems, we pass a single-stage fast R-CNN simulates the overfeat system (and thus avoids other differences in implementation details). In this system, the "suggestion box" is dense sliding, with 3 scales (128,256,512) and 3 aspect ratios (1:1,1:2,2:1). Fast R-CNN is trained to predict the score of a particular class and the position of the return box from these sliding windows. Because the Overfeat system uses multi-scale features, we also use convolution features extracted from 5 scales to evaluate. We use the same 5 scales as in [7,5].

Table 5 compares variants of level two systems and two single-stage systems. With the ZF model, a single-stage system has a 53.9% map. This is 4.8% lower than the two-stage system (58.7%). This experiment proves the effectiveness of the proposed method and target detection in cascaded regions. A similar observation report in [5,13], in two papers with sliding window instead of SS area recommendations have resulted in about 6% decline. We also note that the single-stage system is slower because it has quite a lot of suggestion boxes to handle.

*Table 5: Single-level detection vs. level two recommendation + detection. The results are measured in the Pascal VOC2007 test set using the ZF model and fast r-cnn. RPN uses non-shared features.*

5. Summary

we propose a regional recommendation network (RPN) for the generation of efficient and accurate regional recommendations. By sharing the convolution feature with the subsequent detection network, the steps for the area recommendations are almost lossless. Our approach enables a consistent, **deep-learning** -based target detection system to run at 5-17 fps speed. The RPN also improves the quality of the regional recommendations and thus improves the accuracy of the overall target detection.

*table 6:fast The results of the R-CNN detector and VGG16 in the Pascal VOC 2007 test set. The recommended box for Rpn,fast R-CNN training is 2k. Rpn** represents a version of a non-shared feature. *

*Table 7:fast The results of the R-CNN detector and VGG16 in the Pascal VOC 2012 test set. The recommended box for Rpn,fast R-CNN training is 2k.*

(EXT) Faster r-cnn:towards real-time Object Detection with region proposal Networks (faster RCNN: real-time via regional proposal network)