This article explains in detail the network architecture and workflow of Faster R-CNN, which leads the reader to understand the principle of target detection, and the author also provides the Luminoth realization for everyone's reference.
Last year, we decided to dig deeper into Faster r-cnn, read the original paper and other papers cited in it, and now we have a clear understanding of how it works and how it is implemented.
We finally implemented the Faster in Luminoth R-cnn,luminoth is a tensorflow-based computer vision Toolkit that is easy to train and monitor, supporting many different models. So far, Luminoth has attracted a lot of attention, and we've introduced this project in the ODSC Europe and ODSC West forums. (Odsc,open Data Science Conference, focusing on open source conference).
Based on the work of development luminoth and past reports, we think it's a good idea to integrate all the details of Faster RCNN and related links into one blog, which makes sense for others interested in the field in the future.
background
Faster R-CNN was first released in NIPS in 2015. It has undergone several changes since its release, which will be discussed later in blog post. FASTER-RCNN is the third iteration of the RCNN series of papers, a series of papers and co-authors of the Ross Girshick.
It all started in 2014 in a paper "rich feature hierarchies for accurate object detection and semantic segmentation" (R-CNN), which uses a called Selectiv The e-Search algorithm extracts the regions of interest and uses a standard convolutional neural network (CNN) to classify and adjust these areas. Fast R-cnn was optimized from R-CNN evolution, and Fast R-cnn was released in the first half of 2015, one of the technologies known as the pooling of areas of interest, allowing the network to share computing results and speed up the model. This series of algorithms is eventually optimized to Faster R-CNN, the first fully differentiable model.
Frame
The frame of the Faster r-cnn is composed of several module components, so its frame is somewhat complex. We'll start with a high-level overview, and then we'll cover the specifics of the different components.
Starting with a picture, we'll get:
Complete Faster r-cnn Frame
The input image is characterized by the tensor form (length x Width × height), which is then fed into a pre-trained convolutional neural network to obtain a feature map in the middle layer. Use this feature map as a feature extractor and for the next process.
The above methods are often used in migration learning, especially when training classifiers for small datasets, which usually take the weight of a good workout in another large data set. We'll take a closer look at this section in the next section. Next, we will use the Regional recommendation Network (region proposal NETWORK,RPN). Use the features of the CNN calculation to find the preset number of areas (borders) that may contain the target.
The biggest difficulty in using deep learning for target detection may be to generate a variable-length border list. When modeling with deep neural networks, the last part of the model is usually a fixed-size tensor output (in addition to the recurrent neural network). For example, in a picture category, the output is (n,) the tensor of the shape, n is the number of categories, where the scalar in the I position contains the probability that the picture belongs to Category I.
Problems with variable-length lists in RPN can be resolved using anchor points : Use fixed-size reference borders to position them consistently on the original picture. Instead of directly detecting where the target is, we model the problem in two ways, and for each anchor point, we consider:
It may be a bit confusing, but it's okay, here's an in-depth look.
After acquiring a series of related targets and their position on the original image, the target detection problem can be solved relatively intuitively. Using the features extracted from CNN and the bounding rectangle of the relevant target, we use the area of interest pooling (RoI Pooling)on the feature map of the relevant target, and deposit the feature information related to the target into a new tensor. The following process is consistent with the R-CNN model, using this information:
Obviously, this will lose some of the information, but this is the basic idea of how faster-rcnn to target detection. Next, we'll discuss the framework, the loss function, and the specifics of each component in the training process.
Basic Network
As mentioned before, Faster R-CNN The first step is to use the pre-trained convolutional neural network on the Picture classification task (for example, ImageNet), using the output of the middle-tier feature obtained by the network. This is simple for people with deep learning backgrounds, but understanding how to use and why this is the key, and visualizing the feature output of the middle tier is also important. There is no consistent opinion as to which network framework is the best. The original Faster r-cnn used ZF and Vgg, which were pre-trained on ImageNet, but there were many different networks, and the number of parameters varied greatly in different networks. For example, Mobilenet, a small, efficient framework with a speed preference of about 3.3 million parameters, while ResNet-152 (152 layers), was once the winner of the ImageNet Picture Classification contest, with about 60 million parameters. The latest network structure, such as densenet, can reduce the number of parameters while improving accuracy.
Vgg
Before discussing the merits of the network structure, let's take VGG-16 as an example to try to understand how FASTER-RCNN works.
VGG Network Structure
Vgg, whose name comes from the group name used in the ImageNet ILSVRC 2014 contest, was first published in the paper "very deep convolutional Networks for large-scale Image recognition "The author is Karen Simonyan and Andrew Zisserman. With today's standards, the network is far from the depth, but at the time of the release VGG16 more layers than the usual network, which drives the "depth → more powerful performance → better results" wave (as long as training is feasible).
When you use VGG to classify a task, its input is the amount of 224x224x3 (an RGB picture that represents a 224x224 pixel size). The size of the input image in the classification task is fixed, because the last part of the network requires a fixed-length input for the full-attached layer. The output of the last layer of convolution is usually expanded into one-dimensional tensor before access to the fully connected layer.
Because you want to use the output from the middle tier of the convolutional network, the size of the input picture is no longer limited. At least, this module is no longer a problem because only the convolution layer participates in the calculation. Let's take a closer look at the underlying details to see which layer of convolutional network output to use. The Faster r-cnn paper does not specify which layer to use, but it is observed in the official implementation that the author uses the conv5/conv5_1 layer (Caffe code).
Each layer of convolutional networks extracts more abstract features on the basis of the information on the previous layer. The first layer usually learns the simple edges, and the second layer looks for the target edge mode to activate the more complex shapes in subsequent convolutional networks. In the end, we get a lot smaller than the original picture in the spatial dimension, but characterize the deeper convolution feature graph. The length and width of the feature map will shrink with the pooling between the convolution layers, and the depth will increase with the number of convolution filters.
From picture to convolution feature map
The convolution feature maps all the information in the picture to the dimension of depth while preserving the relative position information of the target object on the original image. For example, if there is a red rectangle in the upper-left corner of the picture that is activated by the convolution layer, then the position information of the red rectangle remains in the upper-left corner of the convolution feature graph.
Vgg vs ResNet
Today, ResNet has replaced most vgg networks as an underlying framework for extracting features. FASTER-RCNN's three co-authors (Kaiming He, shaoqing Ren and Jian Sun) are also the authors of the paper "deep residual learning for Image recognition", which originally introduced Resnets this framework.
The advantage of ResNet vs. Vgg is that it is a deeper, larger network, so there is a greater capacity to learn the information needed. These conclusions are feasible in the image classification task and should be equally effective in the problem of target detection.
ResNet is easier to train after using residuals and batch normalization, and these methods are not available when Vgg is released.
Anchor Point
Now we will use the processed feature map and suggest the target area, which is the area of interest for the classification task. Previously mentioned anchor points are a way to solve the variable length problem and will now be described in detail.
Our goal is to look for the border in the picture. These borders are rectangles of different sizes and proportions. Imagine that we had two goals in the picture before we solved the problem. So the first thing to think about is to train a network that can return 8 values: Two tuples containing (xmin, ymin, Xmax, ymax), and each tuple is used to define the border coordinates of a target. This approach has fundamental problems, for example, images may be of different sizes and proportions, so it is very complicated to train a model that can predict the original coordinates directly and accurately. Another problem is invalid predictions: when predicting (Xmin,xmax) and (Ymin,ymax), you should force the setting xmin to be less than xmax,ymin to be less than ymax.
Another easier way is to predict the offset of the reference border. Use the reference border (Xcenter, ycenter, width, height) to learn the prediction offset (δxcenter,δycenter,δwidth,δheight), Therefore, we can achieve better fitting results by only getting the prediction results of some small numbers and moving the reference variables.
The anchor point is placed on a picture of different sizes and proportions with a fixed border, and is used as a reference border in the prediction of the target position.
The dimensions of the convolution feature map we are dealing with are conv_widthxconv_heightxconv_depth, so that each point on the conv_widthxconv_height of the convolution plot generates a set of anchor points. It is important to note that even if we are the anchor points generated on the feature map, these anchor points will eventually be mapped back to the dimensions of the original image.
Because we only use convolution and pooling layers, the final dimension of the feature graph is proportional to the original image. Mathematically, if the size of the picture is WxH, then the feature map will eventually shrink to the size of w/r and h/r, where r is the secondary sample rate. If we define an anchor point on each spatial location on the feature map, then the anchor point of the final image will be r pixels apart, in Vgg, r=16.
Anchor Point Center of the original image
In order to select a suitable set of anchor points, we typically define a set of fixed dimensions (for example,64px, 128px, 256px, here is the border size ) and scale (for example,0.5, 1, 1.5, here is the border aspect ratio ) of the border, Use all possible combinations of these variables to get a candidate border (there are 1 anchor points and 9 borders in this example).
Left: Anchor Point, Center: the expression of single anchor point in the original image of the feature map space, right: expression of all anchor points in the original
Regional recommendation Network
RPN using convolution feature maps and generating recommendations on images
As we mentioned earlier, RPN accepts all reference frames (anchor points) and outputs a good set of recommendations for the target. It is done by providing two different outputs for each anchor point.
The first output is the probability that the anchor point is the target. If you want to, you can call it "goal score". Note that RPN does not care about the category of the target, only that it is actually not a goal (not a background). We will use this target score to filter out bad predictions and prepare for the second phase. The second output is the border regression, which adjusts the anchor points to better fit their predicted targets.
RPN is implemented in a fully convolution manner, using the convolution feature graph returned by the underlying network as input. First, we use a convolution layer with 512 channels and a 3x3 convolution kernel size, and then we have two parallel convolution layers that use 1x1 convolution cores, and the number of channels depends on the number of anchor points per point.
RPN the convolution implementation of the schema, where k is the number of anchor points
For the classification layer, we output two predictions for each anchor point : It is the score of the background (not the target), and it is the score of the foreground (the actual target).
For regression or border adjustment layers, we output four predictions: Δxcenter, Δycenter, Δwidth, δheight, we will use these values in the anchor point to get the final recommendations.
Use the final suggested coordinates and their goal score, and then you can get a good set of recommendations for the goal.
training, goals, and loss functions
RPN performs two different types of predictions: binary classification and border regression adjustment . In order to train, we divide all the anchor anchor box into two categories. one is "foreground", which overlaps with the true target and its IoU (intersection of Union) value is greater than 0.5, and the other is "background", which does not overlap with any real target or is less than 0.1 of the IoU value of the real target.
We then randomly sampled these anchor points to form a mini batch of size 256 (maintaining the balance ratio between the foreground anchor and the background anchor point).
RPN uses all the anchor box and binary crossover entropy (binary cross entropy) filtered by mini batch to calculate the classification loss. It then calculates the regression loss using only the mini batch anchor points that are marked as foreground. To calculate the goal of the regression, we use the foreground anchor box and the closest real target, and calculate the correct Δ required to convert the anchor box to the target. (because they do not need to be considered in their category)
It is recommended to use Smooth L1 loss to calculate regression errors instead of simple L1 or L2 loss. Smooth L1 is basically L1, but when the error of L1 is small enough, the error is considered almost correct and the loss is reduced at a faster rate, as defined by σ.
The use of dynamic batches is challenging and there are many reasons for this. Even if we try to maintain the balance between the foreground anchor point and the background anchor point, this is not always possible. Depending on the actual target on the image and the size and proportions of the anchor point, you may get 0 foreground anchor points. In this case, we will instead use the anchor point with the maximum IoU value for the real box. This is far from ideal, but it can be useful in order to always have a prospective sample and target to learn.
post-processing
Non-maximal inhibition (non-maximum suppression): because the anchor points overlap frequently, it is recommended that they eventually overlap the same target. To solve the problem of repetitive suggestions, we use a simple algorithm called non-maximum suppression (NMS). The NMS gets a list of recommendations sorted by fractions and iterates over the sorted list, discarding suggestions that the IoU value is greater than a predefined threshold, and suggests a higher score.
While this may seem simple, it is important to be very careful with the IoU threshold setting. Too low, you may lose suggestions for the goal; too high, you may get many suggestions for the same goal. The common value is 0.6.
recommended selection: After applying NMS, we reserve the highest rated N recommendations. The paper uses n=2000, but lowering this number to 50 can still result in quite good results.
Standalone Application
RPN can be used independently, without the need for a second-stage model. In the problem of only one class of objects, the objective probability can be used as the final category probability. This is because in this case, the foreground "=" target category and the background "=" are not target categories.
Some examples of machine learning problems that benefit from independent use of RPN include popular (but still challenging) face detection and text detection.
One of the advantages of using only RPN is the increased speed of training and forecasting. Because RPN is a very simple network that uses only convolutional layers, the prediction time is faster than using a classified base network.
area of interest pooling
After the RPN step, we have a lot of target suggestions that are not assigned categories. The next problem we're going to solve is how to categorize the borders into the categories we want.
The simplest approach is to use each suggestion, crop it out, and then let it pass through the pre-trained base network. We can then use the extracted features as input to the underlying image classifier. The main problem with this approach is that the computational efficiency and speed of running all 2000 recommendations are very low.
Faster R-CNN attempts to resolve or at least mitigate this problem by reusing existing convolution feature maps. This is achieved by extracting a fixed-size feature map with the interest area pool as each recommendation. R-CNN require a fixed-size feature map to classify them into a fixed number of categories.
Area of interest pooling
A simpler approach, which is widely used by target detection implementations including the Luminoth version of Faster r-cnn, is to crop the convolution feature map with each recommendation and then adjust each crop to a fixed size (14x14xconvdepth) with interpolation (usually bilinear). After cropping, the maximum pool of 2x2 cores is used to obtain the final 7x7xconvdepth feature map for each proposal.
The reasons for choosing these exact shapes are related to how the next module (R-CNN) uses it, and these settings are based on the use of the second stage.
region-based convolution neural network
The region-based convolutional neural Network (R-CNN) is the last step in the Faster r-cnn workflow. After the convolution feature map is obtained from the image, it is used to obtain the target recommendations through RPN and eventually extract the features (via RoI Pooling) for each recommendation, and finally we need to classify them using these features. R-CNN attempts to mimic the final stage of the classification CNNs, which outputs a fraction for each possible target class at this stage with an all-connected layer.
The R-CNN has two different goals:
1. Divide the recommendation into a class, plus a background class (to remove bad suggestions).
2. Better adjust the proposed border based on the predicted category.
In the original Faster r-cnn paper, R-CNN used a feature map for each recommendation, flattening it and using two full-join layers with a ReLU activation function of size 4096.
It then uses two different fully connected layers for each of the different targets:
-
A fully connected layer with n+1 units, where N is the total number of classes, and the other is the background class.
A fully-connected layer with 4N units. We want to have a regression forecast, so we need Δcenterx, Δcentery, Δwidth, δheight for each of the possible categories of N categories.
R-CNN Architecture
Training and goals
The goals of R-CNN are almost identical to those of RPN, but consider different possible categories. We use the suggested and true borders and calculate the IoU between them.
Those that have any real border, as long as their IoU is greater than 0.5, are assigned to that real data. Those IoU between 0.1 and 0.5 are marked for the background. Contrary to what we have assigned to RPN, we ignore the suggestion that there is no overlap. This is because at this stage, we assume that there are good suggestions and that we are more interested in solving more difficult situations. Of course, all of these values are super-parameters that can be adjusted to better fit the type of target you're looking for.
The goal of border regression is to calculate the offsets between the suggestions and their corresponding real boxes, only those that are assigned categories based on the IoU threshold.
We randomly sampled a balanced mini batch with a size of 64, which we had up to 25% foreground recommendations (with categories) and 75% backgrounds.
According to the same treatment we did for the RPN loss, the current classification loss is a multi-class cross-entropy loss, using all the selected recommendations and the 25% recommended Smooth L1 loss for matching the real box. Since the output of the fully connected network with the R-CNN border regression has a prediction for each class, we must be careful when we get this loss. In calculating the loss, we only need to consider the correct class.
post-processing
Similar to RPN, we end up with a number of targets that have been assigned categories and need to be dealt with further before returning them.
In order to implement the border adjustment, we must consider which category has the highest probability for the recommendation. We also need to ignore the suggestion of the background class with the highest probability.
We apply class-based NMS after getting the final goal and ignoring the target that is predicted as the background. This is done by grouping by class, sorting it by probability, and then applying the NMS to each individual group.
For our final target list, we can also set a probability threshold and limit the number of targets for each class.
Training
In the original paper, Faster R-CNN was trained in a multi-step way, training each part independently and combining the weights of the training before applying the final comprehensive training method. After that, it was found that end-to-end joint training would bring better results.
After putting the complete model together, we get 4 different losses, two for RPN, and two for r-cnn. We have a trained layer in RPN and r-cnn, and we also have a basic network that can be trained (fine-tuned) or not trained.
The decision whether to train the underlying network depends on the characteristics of the target we want to learn and the computational power available. If we want to detect a similar target to the data on the original data set (used to train the underlying network), there is no need to do anything other than try to compress all the possible performance we can get. On the other hand, in order to fit the complete gradient, it is expensive to train the basic network in both time and necessary hardware.
Combine four different kinds of losses with weighting. This is because relative to the regression loss, we may want to give more weight to the classification loss, or greater weight than the RPN may give r-cnn loss.
In addition to the usual losses, we also have a regularization loss, for brevity, we can skip this part, but they can be defined in both RPN and R-CNN. We use L2 to regularization some layers. Depending on which base network is being used, and if it is trained, it is possible to regularization.
We trained with a random gradient descent momentum algorithm and set the momentum value to 0.9. You can easily train Faster r-cnn with any other optimization method without encountering any major problems.
The learning rate starts at 0.001 and then drops to 0.0001 after 50K steps. This is one of the most important super parameters. When training with Luminoth, we often start with the default values and start making adjustments.
Evaluation
Under some specific IoU thresholds, the standard mean precision mean (MAP) is used to complete the evaluation (for example, map@0.5). MAP is a metric that originates from information retrieval and is often used to calculate errors in sorting problems and to evaluate target detection problems.
We won't go into the details, because these types of metrics are worth summarizing with a full blog, but it's important that MAP will punish you when you miss the boxes you should detect, and when you find something that doesn't exist or detects the same thing more than once.
Conclusion
So far, you should be aware of how Faster R-CNN works, what it is designed for, and how to adjust for specific situations. If you want to get a deeper understanding of how it works, you should look at the Luminoth implementation.
Faster R-CNN is one of the models proven to solve complex computer vision problems with the same principles, and it shows such astonishing results at the beginning of this new deep learning revolution.
The new model is being built not only for target detection, but also for semantic segmentation based on this primitive model, 3D target detection and so on. Some borrow RPN, some borrow r-cnn, others are built on both. Therefore, it is important to fully understand the underlying architecture so that it can solve more extensive and complex problems.
Original address: tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/