Respect the author and reprint the website.
ROI Align is a regional feature aggregation method presented in MASK-RCNN this paper, which solves the problem of region mismatch (mis-alignment) caused by two quantization in ROI pooling operation. Experiments show that the replacement of Roi Pooling with ROI Align in the test task can improve the accuracy of the detection model. 1. Analysis of the limitations of ROI Pooling
In a common two-level detection framework (such as FAST-RCNN,FASTER-RCNN,RFCN), the role of ROI Pooling is to pool the corresponding area into a fixed-size feature map based on the position coordinates of the preselected box for subsequent classification and bounding box regression operations. Since the position of the preselection box is usually obtained by the model regression, it is generally a floating-point number, and the feature map after pooling requires a fixed size. Therefore, ROI pooling this operation has two quantization process. Quantifies the boundary of the candidate box to an integer-number coordinate. The quantized boundary area is divided into K x K units (BIN), and the boundary of each cell is quantified.
In fact, after the above two quantization, the candidate box at this time has a certain deviation from the position at the beginning of the return, this deviation will affect the accuracy of detection or segmentation. In the paper, the author summarizes it as a mismatch problem (misalignment).
Here we use an intuitive example to analyze the above area mismatch. As shown in Figure 1, this is a FASTER-RCNN detection framework. Enter a picture of 800*800 with a 665*665 bounding box (framed by a dog). After the image has been extracted from the backbone network, the feature graph scaling step (stride) is 32. Therefore, the side length of the image and bounding box are 1/32 of the input. 800 is exactly divisible by 32 to 25. But 665 divided by 32 and then 20.78, with decimals, so ROI Pooling directly quantifies it to 20. The next step is to pool the 7*7 in the box so that the bounding box is divided evenly into a 7*7 rectangular area. Obviously, the side length of each rectangular area is 2.86 and contains decimals. The ROI Pooling again quantifies it to 2. After these two quantization, the candidate region has already had the more obvious deviation (as shown in the green section of the figure). More importantly, the deviation of 0.1 pixels on the feature map of the layer, scaled to the original image is 3.2 pixels. Then 0.8 of the deviation, in the original image is close to 30 pixels of the difference, this difference should not be underestimated.
Figure 1
2. The main ideas and specific methods of ROI Align
To address the above drawbacks of ROI pooling, the author proposes an improved approach to ROI align (Figure 2). The idea of ROI align is simple: de-quantization, using bilinear interpolation to get the image values on pixels at the coordinates of floating-point numbers, thus transforming the entire feature aggregation process into a continuous operation. It is worth noting that in the specific algorithm operation, the ROI align does not simply add the coordinate points on the boundary of the candidate area, then pool these coordinate points, but redesign a more elegant set of processes, as shown in Figure 3: Traverse each candidate area, maintain the floating point number boundary does not quantify. The candidate regions are divided into K x K cells, and the boundaries of each cell are not quantified. A fixed four coordinate position is calculated in each cell, the values of these four positions are calculated by bilinear interpolation, and then the maximum pooling operation is performed.
Here is a description of the 3rd of the above steps: This fixed position refers to the position determined by the fixed rule in each rectangular unit (bin). For example, if the sample count is 1, then it is the center of the cell. If the number of samples is 4, it is the center point of dividing the unit into four small squares. It is obvious that the coordinates of these sample points are usually floating-point numbers, so it is necessary to use interpolation to get its pixel values. In the relevant experiment, the authors found that setting the sampling point to 4 would get the best performance, or even a direct setting of 1 would be comparable in performance. In fact, ROI Align does not roipooling that much on the number of traversal sampling points, but it can achieve better performance, thanks largely to the problem of misalignment. It is worth mentioning that, in my experiment, I found that the effect of ROI align on the VOC2007 data set is not as obvious as on Coco. After analysis, the reason for this difference is that Coco has more small targets, while small targets are more affected by the misalignment problem (for example, the same 0.5-pixel bias is negligible for larger targets, but for small targets, the effect of the error is much higher).
Figure 2
Figure 3
3. Reverse propagation of ROI Align
The inverse propagation formula for the general ROI pooling is as follows:
Here, Xi represents the pixel point on the pre-pooled feature map; Yrj represents the first J point of the pool-R candidate region, and i* (R,J) represents the source of the point Yrj pixel value (the coordinates of the point at which the maximum pixel value is selected at maximum pooling). As can be seen from the above, only if the pixel value of a certain point after pooling in the pooling process using the current Point XI pixel value (that is, satisfy i=i* (R,J)), only in the XI location of the gradient.
The reverse propagation of analogy to roipooling,roialign needs to be slightly modified: first, in Roialign, xi* (R,J) is the coordinate position of a floating-point number (the sample point computed at forward propagation), in the feature map before pooling, each with xi* (R,J) The point of the horizontal ordinate is less than 1 should accept with this corresponding point yrj return gradient, so ROI Align reverse propagation formula as follows:
In the formula, D (.) Represents the distance between two points, ΔH and δw represent the difference between Xi and xi* (r,j) transverse ordinate, where the coefficients of bilinear interpolation are multiplied on the original gradient.