Transferred from: https://www.cnblogs.com/guoyaohua/p/8994246.html
Target detection is the foundation of many computer vision tasks, and it provides reliable information whether we need to interact with the text or identify fine-grained categories. In this paper, the target detection is reviewed in the first part, and the target detector based on candidate region is introduced from RCNN, including Fast r-cnn, Faster r-cnn and FPN. The second part focuses on the single detector, including YOLO, SSD and retinanet, which are the most outstanding methods at present.
The target detector based on candidate Region 1.1 sliding window detector
Since AlexNet won the ILSVRC 2012 challenge, the classification with CNN has become the mainstream. A brute force method for target detection is to swipe from left to right, top to bottom, and use classification to identify targets. To detect different target types at different viewing distances, we use windows of different sizes and aspect ratios.
Sliding window (right to left, top to bottom)
We cut the image block from the image according to the sliding window. Because many classifiers take only fixed-size images, these image blocks are transformed by transformations. However, this does not affect the classification accuracy because the classifier can handle the distorted image.
Transform an image into a fixed-size image
The deformed image block is input into the CNN classifier, extracting 4,096 features. After that, we use the SVM classifier to identify the category and another linear regression of the bounding box .
System workflow diagram of sliding window detector
Here is the pseudo-code. We create many windows to detect different targets in different locations. One obvious way to improve performance is to reduce the number of Windows.
For window in Windows
Results = detector (patch)
1.2 Selective Search
Instead of using brute force methods, we create an area of interest (ROI) for target detection with the candidate Area method (region proposal). In selective search (selective Search,ss) , we first set each pixel as a group. Then, calculate the textures for each group and combine the two closest groups. But to avoid a single area engulfing other regions, we first group the smaller groups. We continue to merge the regions until all the regions are joined together. The first line shows how to make the region grow, and the blue rectangle in the second row represents all possible ROIduring the merge process.
Figure Source: Van de Sande et al ICCV ' 11
1.3 r-cnn
R-CNN uses the candidate region approach to create approximately 2000 ROI. These areas are converted to fixed-sized images and fed into the convolutional neural network (the original image is cut based on ROI, reshape is sent to NN learning). the network architecture is followed by several fully connected layers to achieve the target classification and refine the bounding box.
Use candidate area, CNN, affine layer to locate the target. The following is a flowchart of the entire R-CNN system:
Using fewer and higher-quality roi,r-cnn is faster and more accurate than sliding window methods.
ROIs = region_proposal (image) for ROI in ROIs: patch = Get_patch (image, ROI) results = detector (Pach)
The candidate region method has very high computational complexity. To speed up this process, we typically build ROI using a less computational candidate area selection method, and later refine the bounding box using a linear regression (using the full join layer).
Use the regression method to refine the original blue bounding box into red
1.4 Fast R-CNN
R-CNN requires a very large number of candidate areas to improve accuracy, but in fact there are many areas overlapping each other, so r-cnn training and inference is very slow. If we have 2000 candidate areas and each one needs to be fed separately to CNN, we need to extract 2000 features repeatedly for different ROI. (r-cnn a lot of convolution operations are repetitive )
In addition, the feature maps in CNN characterize spatial features in a dense manner, so can we use feature maps instead of original images to detect targets?
Calculate ROI directly using feature graphs
Fast R-CNN uses the feature extractor (CNN) to extract the features of the entire image first, rather than extracting each chunk of the image multiple times from the beginning. We can then apply the method that creates the candidate region directly onto the extracted feature map . For example, Fast R-CNN chooses the feture Map of the convolutional layer conv5 output in VGG16 to generate ROI, which is then combined with the corresponding feature map to be cropped as a feature tile and used in the target detection task. We use ROI pooling to convert feature tiles to fixed sizes and feed them to a fully connected layer for classification and positioning. Because FAST-RCNN does not repeatedly extract features, it can significantly reduce processing time.
Apply candidate areas directly to feature maps and convert them to fixed-size feature tiles using ROI pooling
The following is a flowchart for Fast R-CNN:
In the pseudo code below, the large computational feature extraction process is removed from the for loop, so the speed is significantly improved. Fast R-CNN is trained at 150 times times the speed of the r-cnn and is 10 times times faster than the latter.
Feature_maps = Process (image) ROIs = Region_proposal (feature_maps) for ROI in ROIs: patch = roi_pooling (Feature_maps, ROI) results = Detector2 (patch)
The most important point of Fast r-cnn is that the whole network, which includes feature extractor, classifier and bounding box regression, can train end-to-end with multi-task loss function, which combines the method of classifying loss and locating loss, which greatly improves the accuracy of the model.
Because Fast R-CNN uses the full connectivity layer, we apply ROI pooling to convert different sizes of ROI to fixed size.
For the sake of brevity, we first convert the 8x8 feature map to a predefined 2x2 size.
Top left corner: feature map.
Upper-right corner: overlaps the ROI (blue area) with the feature map.
Bottom left: Splits ROI into the target dimension. For example, for 2x2 targets, we split the ROI into 4 parts of similar or equal size.
Bottom right corner: Find the maximum value for each section and get the transformed feature map.
Input feature map (top left), output feature map (bottom right), ROI (upper right, blue box)
Follow the steps above to get a 2x2 feature tile that can be fed to the classifier and bounding box regression.
1.5 Faster R-CNN
Fast R-CNN relies on external candidate area methods, such as selective search. However, these algorithms run on the CPU and are slow. In the test, Fast R-CNN takes 2.3 seconds to make predictions, where 2 seconds is used to generate 2000 ROI.
Feature_maps = Process (image) ROIs = Region_proposal (feature_maps) # expensive!for ROI in ROIs patch = Roi_ Pooling (Feature_maps, ROI) results = Detector2 (patch)
The Faster R-CNN uses the same design as the Fast r-cnn, except that it replaces the candidate area approach with an internal deep network . The new candidate Area Network (RPN) is more efficient at generating ROI and runs at a rate of 10 milliseconds per image.
Faster r-cnn Flowchart is the same as Fast r-cnn
External candidate Area method instead of internal deep network
- Candidate Area Network (RPN)
The candidate Area Network (RPN) takes the output feature graph of the first convolutional network as input. It slides a 3x3 convolution core on the feature map to use the Convolutional network (the ZF network shown below) to build a candidate area unrelated to the category. Other deep networks, such as Vgg or ResNet, can be used for more comprehensive feature extraction, but this needs to be at the cost of speed. The ZF network will eventually output 256 values, which will be fed to two independent, fully-connected layers to predict the bounding box and two objectness fractions, and the two objectness fractions measure whether the bounding box contains a target. We can actually use a regression to calculate a single objectness fraction, but for brevity, Faster R-CNN uses only two categories of classifiers: the category with the target and the category without the target .
For each position in the feature map, the RPN will make a K-th prediction. As a result, RPN will output 4XK coordinates and 2XK scores per position. The 8x8 feature map is shown, and there is a 3x3 convolution core to perform the operation, which finally outputs 8x8x3 ROI (where k=3). (right) shows 3 candidate regions for a single location.
Here are 3 guesses, which we will refine later. Since only a correct conjecture is needed, our initial guess is best to cover different shapes and sizes. Therefore, Faster R-CNN does not create a random bounding box. Instead, it predicts some offsets (such as x, y) associated with the reference box with the upper-left corner named "anchor point." We limit the values of these offsets, so our conjecture is still similar to the anchor point.
To make a K-prediction for each location, we need K-anchor points centered on each location. Each prediction is associated with a specific anchor point, but the anchor points of the same shape are shared in different locations.
These anchor points are carefully selected, so they are diverse and cover realistic targets with different proportions and aspect ratios. This allows us to guide the initial training with a better guess and allow each prediction to be dedicated to a particular shape. This strategy makes early training more stable and simple.
Figure Source: Https://arxiv.org/pdf/1506.01497.pdf
Faster r-cnn use more anchor points. It deploys 9 anchor-point boxes: 3 different-aspect-ratio 3 anchor-point boxes of different sizes. Each location uses 9 anchor points, each of which generates 2x9 objectness fractions and 4x9 coordinates.
- Performance of the R-cnn method
As shown, the Faster r-cnn is much faster.
1.6 Region-based full-convolution neural network (R-FCN)
Suppose we have only one feature map to detect the right eye. So can we use it to locate faces? It should be ok. Because the right eye should be in the upper left corner of the face image, we can use this to position the entire face.
If we have other features to detect the left eye, nose or mouth, we can combine the results to better position the face.
Now let's review all the questions. In Faster R-CNN, the detector uses multiple fully connected layers for prediction. If there are 2000 ROI, then the cost is very high.
Feature_maps = Process (image) ROIs = Region_proposal (feature_maps) for ROI in ROIs patch = roi_pooling (Feature_maps, ROI) class_scores, box = detector (patch) # expensive! Class_probabilities = Softmax (class_scores)
R-FCN accelerates by reducing the amount of work required for each ROI (the full connectivity layer is removed). The above region-based feature maps are independent of ROI and can be calculated separately from each ROI. The rest of the work is relatively simple, so R-FCN faster than Faster r-cnn.
Feature_maps = Process (image) ROIs = Region_proposal (feature_maps) score_maps = Compute_score_map (feature_maps) For ROI in ROIs V = Region_roi_pool (Score_maps, ROI) class_scores, box = average (V) # Much simpler! Class_probabilities = Softmax (class_scores)
Now let's take a look at the 5x5 feature diagram M, which contains a blue square inside. We divide the squares evenly into 3x3 areas. Now, we have created a new feature map in M to detect the upper-left corner (TL) of the block. This new feature is shown in Turu (right). Only the yellow grid cell [2, 2] is active.
Create a new feature map on the left to detect the upper-left corner of the target
We divide the block into 9 parts, creating 9 feature graphs, each of which detects the corresponding target area. These feature graphs are known as positional-sensitive score plots (position-sensitive score map)because each plot detects the sub-region of the target (calculates its score).
Generate 9 Score Graphs
The red dashed rectangle is the recommended ROI. We split it into a 3x3 area and asked what the probability of each region's target counterpart would be. For example, the ROI area in the upper-left corner contains the probability of the left eye. We store the result as a 3x3 vote array, as shown in (right). For example, Vote_array[0][0] contains a score for the upper-left corner of the area that contains the corresponding portion of the target.
Applying ROI to a feature map, outputting a 3 x 3 array
The process of mapping the score map (Feature map) and ROI to the vote array is known as location-sensitive ROI pooling (position-sensitive roi-pool). This process is very close to the ROI pooling discussed earlier.
Add part of ROI to the corresponding score graph, calculate v[i][j]
After calculating all the values of the location-sensitive ROI pooling, the category score is the average of all of its element scores.
ROI pooling
If we have a C category to detect. We expanded it to a C + 1 category, which adds a new category to the background (non-target). Each category has a 3x3 score graph, so there is a total (c+1) x3x3 score graph. Use the scoring chart for each category to predict the category score for that category. Then we apply the Softmax function to these scores to calculate the probability of each category.
The following is the flow chart, in our case, k=3.
1.7 R-CNN Series Summary
We first learned about the underlying sliding window algorithm:
For window in windows patch = get_patch (image, window) results = detector (patch)
Then try reducing the number of Windows to minimize the amount of work in the For loop.
ROIs = region_proposal (image) for ROI in ROIs patch = Get_patch (image, ROI) results = detector (patch)
Second, one-time target detection device
In the second part, we will review the single target detectors (including SSDs, YOLO, YOLOv2, YOLOv3). We will analyze FPN to understand how Multiscale feature maps can improve accuracy, especially for small target detection, which is often poorly detected in a single detector. Then we will analyze Focal loss and retinanet to see how they solve the problem of class imbalance in the training process.
2.1 One-time detector
In Faster r-cnn, there is a dedicated candidate area network after the classifier.
Faster R-CNN Workflow
Region-based detectors are accurate, but pay a price. The Faster R-CNN processes 7 Frames of image (7 FPS) per second on the PASCAL VOC 2007 test set. Similar to R-FCN, researchers streamline processes by reducing the amount of work per ROI.
Feature_maps = Process (image) ROIs = Region_proposal (feature_maps) for ROI in ROIs patch = Roi_align (Feature_maps, ROI ) results = Detector2 (patch) # Reduce The amount of work here!
As an alternative, do we need a separate candidate region step? Can we get the bounding box and category directly in one step?
Feature_maps = Process (image) results = Detector3 (feature_maps) # No More separate step for ROIs
Let's look at the sliding window detector again. We can detect the target by sliding the window on the feature map. For different target types, we use different window types. The fatal error of the previous sliding window method is that it uses the window as the final bounding box, which requires very many shapes to cover most of the targets. A more efficient approach is to use the window as an initial conjecture, so that we have a detector that predicts both the category and the bounding box from the current sliding window.
Prediction based on sliding window
This concept is similar to the anchor point in Faster r-cnn. However, a single detector predicts both the bounding box and the category. For example, we have an 8x8 feature map and make K predictions at each location, that is, a total of 8x8xk predictions.
64 positions
At each location, we have K anchor points (the anchor point is a fixed initial bounding box conjecture), and an anchor point corresponds to a specific position. We use the same anchor point shape to carefully select the anchor point and each position.
Make 4 predictions at each location using 4 anchor points
The following are 4 anchor points (green) and 4 corresponding predictions (blue), each of which corresponds to a specific anchor point.
4 predictions, one anchor point per forecast
in Faster r-cnn, we use convolutional cores to make predictions for 5 parameters: 4 parameters correspond to the predicted border of an anchor point, and 1 parameters correspond to the objectness confidence score. Therefore, the 3x3xdx5 convolution kernel converts the feature map from 8x8xd to 8x8x5.
Calculate predictions using a 3x3 convolution kernel
In a single detector, convolutional cores also predict the probability of a class C to perform a classification (one category per probability). So we apply a 3x3xdx25 convolution kernel to convert the feature map from 8x8xd to 8x8x25 (c=20).
K predictions per location with 25 parameters per forecast
A single detector typically requires a tradeoff between accuracy and real-time processing speed . They are prone to problems when detecting targets that are too close or too small . In the lower left corner there are 9 Santa Claus, but a single detector only detects 5.
2.2 SSD (single Shot multibox Detector)
An SSD is a single detector that uses the VGG19 network as a feature extractor (as with the CNN used in Faster r-cnn). We add a custom convolution layer (blue) after the network and use convolutional cores (green) to perform predictions.
Perform a single prediction of categories and locations at the same time
However, the convolution layer reduces the spatial dimension and resolution. So the above model can only detect larger targets . to solve this problem, we perform independent target detection from multiple feature maps. The multi-scale feature map is used to detect independently.
Use multi-scale feature maps to detect
The following is a feature diagram illustration.
Figure Source: Https://arxiv.org/pdf/1512.02325.pdf
SSDs use the deeper layers of the convolutional network to detect targets. If we redraw at a near-real scale, we will find that the spatial resolution of the image has been significantly reduced and may not be able to locate small targets that are difficult to detect in low resolution. If such a problem arises, we need to increase the resolution of the input image.
2.3 YOLO
YOLO is another one-time target detector.
YOLO used darknet to perform feature detection after the convolution layer.
However, it does not use Multiscale feature maps for independent testing. Instead, it smoothes the feature map part and stitching it with another lower-resolution feature map. For example, YOLO a 28x28x512 layer to 14x14x2048 and then stitching it with the 14x14x1024 feature map. Later, YOLO is used to predict the convolution core on the new 14x14x3072 layer.
YOLO (v2) has made many implementation improvements, raising the MAP value from 63.4 at the first release to 78.6. YOLO9000 can detect 9000 different categories of targets.
Figure Source: Https://arxiv.org/pdf/1612.08242.pdf
The following are the MAP and FPS comparisons of the different detectors in the YOLO paper. YOLOV2 can handle input images of different resolutions. Low-resolution images can get higher FPS, but MAP values are low.
Figure Source: Https://arxiv.org/pdf/1612.08242.pdf
YOLOv3 uses a more complex backbone network to extract features. DarkNet-53 are mainly composed of 3x3 and 1x1 convolution cores and ResNet-like skipping connections. Compared to resnet-152,darknet there is a lower bflop (1 billion-time floating-point arithmetic), but the same classification accuracy can be obtained at twice times the same rate.
Figure Source: Https://pjreddie.com/media/files/papers/YOLOv3.pdf
YOLOv3 also adds feature pyramids to better detect small targets. The following are the tradeoffs between accuracy and speed of different detectors.
Figure Source: Https://pjreddie.com/media/files/papers/YOLOv3.pdf
Feature Pyramid Network (FPN)
Detecting targets of different scales is challenging, especially for small target detection. feature Pyramid Network (FPN) is a feature extractor designed to improve accuracy and speed. It replaces the feature extractor in the detector (such as Faster r-cnn) and produces a higher-quality feature map pyramid.
Data flow
FPN (Map Source: https://arxiv.org/pdf/1612.03144.pdf)
The FPN consists of a bottom-up and top-down path. The bottom-up path is a commonly used convolutional network for feature extraction. Spatial resolution is down from the bottom up. When higher-level structures are detected, the semantic value of each layer increases.
Feature Extraction in FPN (edited from the original paper)
SSDs complete detection through multiple feature graphs. However, the bottom level is not chosen to perform target detection. Their resolution is high but the semantic value is not enough, resulting in a significant decrease in speed and cannot be used. SSDs use only the upper layer to perform target detection, so the detection performance of small objects is poor.
Image modification from thesis Https://arxiv.org/pdf/1612.03144.pdf
FPN provides a top-down path to build a high-resolution layer from a semantically rich layer.
Rebuild spatial resolution from top to bottom (edited from original paper)
Although the reconstruction layer is semantically strong, the position of the target is not accurate after all the on-and-off sampling. Adding a horizontal connection between the rebuild layer and the corresponding feature map can make location detection more accurate.
Increased skip connection (from original paper)
The bottom-up and top-down paths are explained in detail. Among them, P2, P3, P4 and P5 are feature map pyramids for target detection.
FPN combined with RPN
FPN is not simply a target detector, but also a target detector and a feature detector for collaborative work. Pass to each feature map (P2 to P5) to complete the target detection, respectively.
FPN with Fast r-cnn or Faster r-cnn
In FPN, we generate a pyramid of feature maps. Use RPN (see above) to generate ROI. Based on the ROI size, we select the most appropriate size feature layer to extract the feature block.
Difficult cases
For most detection algorithms, such as SSDs and YOLO, we do a lot more predictions than the actual number of targets. So the wrong predictions are more than the correct predictions. This creates an imbalance in the category of training that is unfavourable. Training is more about learning the background than testing the target. However, we need negative sampling to learn what a poor prediction is. Therefore, we calculate the confidence loss to classify the training samples. Choose the best ones to ensure that negative samples and positive samples are not more than 3:1 in proportion. This makes training faster and more stable.
Non-maximal value suppression in inference process
The detector will repeat the detection for the same target. We use non-maxima suppression to remove a low-confidence duplicate detection. Rank predictions from highest to lowest in confidence. If any of the predictions are of the same category as the current forecast and the two IoU are greater than 0.5, we will remove them from this sequence.
Focal Loss (retinanet)
Category imbalance can compromise performance. SSDs Resample The ratio of the target class to the background class during training so that it does not overwhelm the image background. Focal loss (FL) uses a different approach to reduce the loss of well-trained classes. Therefore, as long as the model is able to detect the background well, it can reduce its loss and re-enhance the training of the target class. We start with the crossover entropy loss CE and add a weight to reduce the high confidence class of CE.
For example, for γ= 0.5, the Focal loss of a well-categorized sample is approaching 0.
Edited from the original paper
This is based on FPN, ResNet, and retiannetbuilt with Focal loss.
Retinanet
Summary of target detection algorithm