Original works, allow reprint, please be sure to use hyperlinks in the form of the original source of the article, author information and this statement. Otherwise, the legal liability will be investigated. http://underthehood.blog.51cto.com/2531780/658350
by Raysaint 2011/09/05
1 reviews
combining paper [1] and Rob Hess's Open source sift code (found OpenCV2.3 's source is also used by Rob Hess's sift code) to study the SIFT algorithm, the following is a summary:
In the field of computer vision, image matching is the most important aspect of many problems, including object and scene recognition, 3D reconstruction through multiple images, stereo matching and motion tracking. The SIFT feature is invariant to the rotation and scale change of the image, and has partial invariance for illumination change and camera angle change. The main steps for generating image features of the SIFT algorithm are as follows:
(1) Scale spatial extremum detection: Search for image locations on all scales. A Gaussian differential function is adopted to identify potential points of interest that are invariant to scale and rotation.
(2) Positioning of key points: in each candidate position, the position and scale are determined by a fitting fine model. The key points are chosen based on their degree of stability.
(3) Direction determination: Based on the image of the local gradient direction, assigned to each key point position in one or more directions. All subsequent operations on the image data are transformed relative to the direction, scale, and position of the key points, providing invariance for these transformations.
(4) Key point description: In the neighborhood around each key point, the local gradient of the image is measured at the selected scale. These gradients are transformed into representations that allow for larger local shape deformations and illumination variations.
2 Detailed procedures
This first simplifies a structure that describes the SIFT feature, based on the code of Rob Hess:
struct feature
{
Double X; /*!< The X coordinate on the input image */
Double y; /*!< the y-coordinate on the input image */
Double SCL; /*!< scale of key points */
Double Ori; /*!< direction of key points */
int D; Length of/*!< descriptor */
Double Descr[feature_max_d]; /*!< Descriptor */
int R; /*!< detects key points on the layer of the image where the key points are on the line */
int C; /*!< the key point in the layer of the image where the key is in the column */
int OCTV; /*!< the index of the group in which the image of the layer is detected by the key point */
int INTVL; /*!< The index of the layer where the layer of the image is detected by the key point */
Double SUBINTVL; /*!< detects the interpolation of the layer where the layer of the image is located at the key point */
Double SCL_OCTV; /*!< detects the scale of the group where the image of the key is located */
};
2.1 Scale space representation
The scale space of the image is represented as a function l (x, y,σ), which is generated by a variation of the Gaussian function g (x, y,σ) and the image I (x, y) convolution, see lindeberg.t's Paper [2]. That
Which represents the convolution operation, while G (X, y,σ) is:
The SIFT algorithm suggests that the detection of key points at a certain scale can be obtained by subtracting images from two adjacent Gaussian scale spaces, and obtaining a dog (Gaussian derivative) response value image D (x, y,σ). The key points are then positioned in the location space and scale space by a local great search of the response value image D (x, y,σ). which
In the formula, K is a constant of two contiguous scale space multiples.
In fact, dog is an approximation of log (Laplace). According to the heat conduction equation (note: here t=σ^2):
The finite difference operation is performed on the upper equation to obtain:
Because the constant k-1 does not affect the position of the extremum point, the position of the center of the key point. So, D (x, y,σ) is an approximate representation, that is, the dog is an approximate representation of log.
Next is how to calculate the dog. The first is to construct the Gaussian pyramid of the image. The image pyramid is divided into O Group, a group called a octave, each group is divided into multiple layers, the interval number is S, so there is s+3 (s+1+2,2 represents in the upper and lower layers of the image, search extremum only in the middle of the s+1 layer image search) layer image, The first layer of the next set of images is sampled from the bottom-down third layer of the previous set (if the layer index starts at 0, the first s layer), and this is done to reduce the volume of the convolution operation. The dog is subtracted from the Gaussian-scale spatial image of each group in the Gaussian pyramid. Such as:
In the Gaussian pyramid, the relationship between Σ and O, S is as follows:
Where σ0 is the base level, O is the index of the group octave, and S is the layer index of the image in the group.
We consider the initial image to have an initial σ=0.5 Gaussian blur (in order to prevent significant confounding effects). From the above formula can be seen, (2) in the form of k=2^ (1/S).
At the very beginning of the Gaussian pyramid, it is necessary to pre-blur the input image as the No. 0-level image of the No. 0 group, which is equivalent to discarding the highest spatial sampling rate. Therefore, the usual practice is to first expand the scale of the image to create a No. 0 group. We assume that the initial input image is σn=0.5 Gaussian blur in order to fight the confusion, and if the size of the input image is enlarged by bilinear interpolation by one-fold, then it is equivalent to σn=1.0. Therefore, the actual
The above uses the Gaussian semigroup properties to obtain, namely:
Which t=σ2.
In the actual programming implementation (2) the following implementation:
In D.lowe's paper and Rob Hess's code, the default parameters used are:
The value of σn=0.5,σ0=1.6, s=2, O is obtained by subtracting 2 from the length of the image by 2 of the size of the smaller one (ensuring that the last set of images has at least 4 pixels).
Here is an example of a scale space representation:
Local extremum detection in 2.2-scale space (initial exploration of key points)
The initial exploration of points of interest is accomplished by comparing the adjacent layers of each dog in the same group. In order to find the extreme point of the scale space, each sample point is compared with its neighboring point to see if it is larger or smaller than the neighboring point of its image domain and scale domain. As shown in the following:
The detection point of X is compared with 8 points of the same scale and the 9x2 point corresponding to the upper and lower adjacent scales to ensure that the extremum points are detected in both the scale space and the two-dimensional image location space. The search process starts at the second level of each group, starting at the level of Layer index 1, and ends with layers indexed to S.
Here is the result of the local extremum test:
2.3 Interpolation of local extremum positions (precise positioning of key points)
The above extreme point search is carried out in the discrete space, the detected extremum point is not the true meaning of the extreme point. The difference between the extreme points of the two-dimensional function discrete space and the extreme points of continuous space is demonstrated. It is called subpixel interpolation (sub-pixel interpolation) to interpolate the well-known discrete-space extremum points of the continuous space.
The interpolation method is based on the Taylor series expansion, the specific process is described in [3], here only the conclusion:
The offset of the extreme point position (position vector is):
The extremum of Extreme point is
which represents an offset relative to the interpolation center, when it has an offset greater than 0.5 on either dimension (that is, x or Y or σ), meaning that the center of interpolation has been shifted to its neighboring point, so the position of the current key must be changed (plus, the rounding rounding is actually done), At the same time, interpolation is repeated at the new position until it converges, or it may exceed the number of iterations set or beyond the bounds of the image, at which point it should be deleted. In addition, when the absolute value is less than 0.03 (assuming that the image's grayscale value is between 0 and 1.0), its response value is too small, such a point is susceptible to noise interference and become unstable, so also to be deleted.
The following is the result of removing the key with a smaller absolute value:
2.4 Elimination of boundary response (further filtering of key points)
Because dog has a strong response value to the edges in the image, these points are unstable points once the feature falls on the edge of the graph. According to the harris[4] corner, it is known that a corner point panning in any direction should guarantee a drastic change in the pixel values within the local window, while the points on the edge move along the edge direction, and the pixel values in the local window do not change substantially. As shown in the following:
Similarly, a flat dog response peak tends to have a larger principal curvature across the edge, while a smaller principal curvature in the vertical direction. The main curvature can be obtained by the 2x2 hessain Matrix H:
The eigenvalues of H are proportional to the principal curvature of D. You can avoid finding specific eigenvalues, because we only care about the proportions of the eigenvalues. The alpha is a large eigenvalue, and β is a smaller characteristic value.
So
If Gamma is the ratio between a larger eigenvalue and a smaller eigenvalue, α=γβ
The results of the above formula are only related to the proportions of the two features, regardless of the specific eigenvalues. When the two eigenvalues are equal, the smallest, and the value increases as the gamma increases. So, to check the ratio of the main curvature to a certain threshold gamma, just check if the following formula is true:
D.lowe the γ=10 in the paper, the feature points with a ratio greater than 10 for the principal curvature will be deleted, otherwise these feature points are retained. The following is the result of the remove boundary response:
To this end, the position of the feature point has been determined, assuming that a feature point is determined at (Xk,yk) position on D (x,y,σ (O,s)), and the offset xo= (xo,yo,so) after interpolation (note: The derivative of the scale σ is evaluated in the dog pyramid based on the same set of images before and after the difference, That is, the s+1 layer image minus the s-1 layer image is divided by 2, so instead of σo with so, the previously given description of the SIFT feature structure will be populated as follows:
Feat.x = (Xk+xo) x2^o
Feat.y = (yk+yo) x2^o
FEAT.R = YK;
feat.c = XK;
FEAT.OCTV = O;
FEAT.INTVL = s;
FEAT.SUBINTVL = so;
(Note: If the dimensions of the first image are doubled, then feat.x and Feat.y are divided by 2)
2.5 scale calculation of key points
This step calculates the scale of the key points, according to the formula given in section 2.1 (6), as follows:
(Note: If the size of the first image is doubled, then the FEAT.SCL is divided by 2)
2.6 Direction distribution of feature points
In order to realize the invariance of image rotation, a directional datum is obtained based on the local image structure of the detected feature points. We use the image gradient method to find the stable direction of the local structure. For the feature points that have been detected, we know the exact position and scale of the feature point in the dog pyramid (determined by FEAT.R, feat.c, FEAT.OCTV, FEAT.INTVL, FEAT.SCL_OCTV), and the corresponding Gaussian image L (x,y,σ ( FEAT.OCTV, FEAT.INTVL)) uses finite difference to calculate the amplitude and amplitude (modulus) of the image gradient in the region with the feature point as the center, with the radius of the 3X1.5XFEAT.SCL_OCTV, as shown, the amplitude angle and amplitude are calculated as follows:
After the gradient calculation of the Gaussian image in the neighborhood of the feature point is completed, the gradient direction and amplitude of the pixels in the neighborhood are statistically calculated using the histogram. The gradient direction histogram is the gradient direction angle, and the longitudinal axis is the gradient amplitude summation of the gradient direction angle. The gradient direction histogram divides the range of 0-360 degrees into 36 bars (bins), one bar per 10 degrees. The peak of the histogram represents the main direction of the image gradient within the neighborhood of the feature point, which is also the main direction of the feature point, as shown in:
The gradient amplitude of each sample point which accumulates to the gradient direction histogram is weighted, using the circular Gaussian weighted function, the standard deviation σ is 1.5XFEAT.SCL_OCTV, (the local neighborhood radius above is obtained by the 3σ theorem). As shown, because the SIFT algorithm only considers scale and rotational invariance, it does not consider affine invariance. By Gaussian weighting, the gradient amplitude near the feature point has a larger weight, which can partially compensate for the instability of the feature points caused by no affine invariance.
Also, when there is a peak equivalent to 80% energy of the main peak energy, this direction is considered to be the secondary direction of the feature point. A feature point may be specified with multiple orientations (one main direction, more than one secondary direction), which can enhance the robustness of the match as shown in. In practical programming, it is to copy the feature points into multiple feature points and assign the direction values to these duplicated feature points, and the discrete gradient direction histogram should be interpolated to fit the interpolation to obtain a more accurate direction angle value.
The following are the results of the feature direction and scale assignment (the arrows point to the direction, the length represents the scale):
2.7 Feature Descriptor sub-representation (feature vector generation of feature points)
The SIFT Descriptor H (x,y,θ) is a representation of the statistical results of a Gaussian image gradient near a feature point, which is a three-dimensional array, but is usually represented as a vector (1-dimensional vector) in the programming implementation. The characteristic descriptor is related to the scale of the feature point, so the evaluation of the gradient should be performed on the Gaussian image corresponding to the feature point. The neighborhood near the feature point is divided into bp*bp sub-regions, each of which has a size of MXFEAT.SCL_OCTV sub-pixel, where m=3,bp=4. Considering the actual calculation, bilinear interpolation is required, and the computed image area is MXFEAT.SCL_OCTVX (bp+1). If the rotation factor is considered again, the actual computed image area should be MXFEAT.SCL_OCTVX (bp+1) x√2. As shown in the following:
In order to ensure that the feature vector has no deformation, it is necessary to rotate the position and direction of the gradient of the image in the neighborhood adjacent to the feature point (MXFEAT.SCL_OCTVX (bp+1) X√2XMXFEAT.SCL_OCTVX (bp+1) x√2) in the direction of the center of the feature point FE At.ori. As shown in the following:
After the position and angle rotation of the neighborhood image gradient near the feature point, a mxfeat.scl_octvxBpxmxfeat.scl_octvxBp-sized image area is taken as the center of the feature point, and the interval is divided into bpxbp sub-area, each interval is mxfeat. SCL_OCTV pixels, calculates the gradient direction histogram in 8 directions in each sub-region, plots the accumulated values of each gradient direction, and forms a seed point. is different from the main direction of the feature point, the gradient direction histogram of each sub-region divides the range of 0-360 degrees into 8 direction ranges, each of which is 45 degrees, so that there is a gradient strength information of 8 directions for each seed point. Because of the 4x4 (BPXBP) sub-region, there is a total of 4x4x8=128 data, that is, the length of the eigenvector feat.d=128. As shown in the following:
In the same way, for the feature vectors to be Gaussian weighted, the standard deviation is the standard Gaussian function with MXFEAT.SCL_OCTVXBP/2. After the formation of eigenvector, it is necessary to treat the change of illumination in order to eliminate the influence of the light. After normalization, the value of the eigenvector is greater than 0.2 truncation, that is, greater than 0.2 of the value is only 0.2, and then the normalization process, the purpose is to improve the identification of features.
3 questions
(1) In each need to determine the neighborhood radius size of the feature point, the scale used is not the formula (6) to determine the scale, but to remove the group ordinal O calculated scale, which is why?
(2) The point of absolute value less than 0.3 to be discarded, in Rob Hess's sift code, in fact, the absolute value of less than 0.3/s point is discarded, why this contrast threshold is divided by s?
4 references
[1] D.lowe, "distinctive Image Features from Scale-invariant keypoints", January 5, 2004
[2] T.lindeberg, "scale-space theory:a basic tool for analysing structures at different scales"
[3] Wang Yongming, Wang Guijin, "image local invariance characteristics and description", National defense Industry Press
[4] C. Harris and M. Stephens (1988). "A combined corner and Edge detector". Proceedings of the 4th Alvey Vision Conference. pp. 147–151.
This article is from the "Underthehood" blog, make sure to keep this source http://underthehood.blog.51cto.com/2531780/658350
Research on SIFT algorithm