Transferred from: http://blog.csdn.net/zddmail/article/details/7521424
The original learning sift time, I think this article is worthy of detailed explanation of the two words, special turn. A detailed analysis of the matching algorithm of scale invariant feature transform
Scale invariant Feature Transform (SIFT)
Just for Fun
zdd zddmail@gmail.com
For starters, from David G.lowe's thesis to implementation, there are many gaps in this article to help you across.
1. Sift Overview
Scale invariant feature conversion (Scale-invariant feature transform or SIFT) is a computer vision algorithm used to detect and describe the local features in the image, it looks for extreme points in the spatial scale, and extracts its position, scale, rotation invariants, the algorithm is David Lowe was published in 1999 and completed in 2004.
Its applications include object identification, robot map sensing and navigation, image stitching, 3D model building, gesture recognition, image tracking, and motion alignment.
The algorithm has its patents, and the patent holders are the University of British Columbia.
The description and detection of the local image feature can help to identify the object, which is independent of the size and rotation of the image based on the point of interest of some local appearance on the object SIFT. The tolerance for light, noise, and slight change of perspective is also quite high. Based on these features, they are highly significant and relatively easy to capture, and are easily recognizable and rarely mistaken in a feature database with a large number of masters. Using the SIFT feature to describe the detection rate for some object shading is also quite high, even if only 3 or more Sift object features are sufficient to calculate the position and orientation. Under the current computer hardware speed and the small characteristic database condition, the recognition speed can be close to the immediate operation. The information of SIFT features is large, which is suitable for fast and accurate matching in mass database.
The features of the SIFT algorithm are:
1. Sift feature is a local feature of the image, its rotation, scale scaling, brightness changes remain invariant to the perspective change, affine transformation, noise also maintain a certain degree of stability;
2. The uniqueness (distinctiveness) is good, the information is rich, is suitable in the massive characteristic database to carry on the fast, the accurate match;
3. Multi-volume, even if a few objects can also produce a large number of sift eigenvector;
4. High-speed, optimized sift matching algorithm can even achieve real-time requirements;
5. Extensibility, can be easily combined with other forms of eigenvector.
Sift algorithms can solve the problem:
The performance of image Registration/target recognition tracking is affected by factors such as the state of the target, the environment in which the scene is located, and the imaging characteristics of the imaging equipment. And the SIFT algorithm can be solved to some extent:
1. Rotation, scaling, panning (RST) for the target
2. Image affine/projective transformations (viewpoint viewpoint)
3. Lighting effects (illumination)
4. Target occlusion (occlusion)
5. Clutter Scene (clutter)
6. Noise
The essence of SIFT algorithm is to find the key points (feature points) in different scale space, and calculate the direction of the key points. Sift found the key points are some very prominent, will not be due to light, affine transformation and noise factors such as changes in points, such as corner points, edge points, dark spots and bright areas of the dark spots.
Lowe decomposes the SIFT algorithm into the following four steps:
1. Scale Space Extremum detection: Search for image locations on all scales. A Gaussian differential function is adopted to identify potential points of interest that are invariant to scale and rotation.
2. Key point positioning: In each candidate position, the position and scale are determined by a well-fitted model. The key points are chosen based on their degree of stability.
3. Direction determination: Based on the image local gradient direction, assigned to each key point position in one or more directions. All subsequent operations on the image data are transformed relative to the direction, scale, and position of the key points, providing invariance for these transformations.
4. Key point description: In the neighborhood surrounding each key point, the local gradient of the image is measured at the selected scale. These gradients are transformed into representations that allow for larger local shape deformations and illumination variations.
This article follows the steps of Lowe, referring to Rob Hess and Andrea Vedaldi source code, detailed SIFT algorithm implementation process. 2. Gaussian Blur
Sift algorithm is to find the key points in different scale space, while the acquisition of scale space needs to be realized by Gaussian Blur, Lindeberg and others have proved that Gaussian convolution kernel is the only transformation kernel that realizes scale transformation and is the only linear kernel. This section first describes the Gaussian blur algorithm. 2.1 Ivigos Function
Gaussian Blur is an image filter that uses a normal distribution (Gaussian function) to calculate a fuzzy template and uses the template to do convolution operations with the original image to achieve the purpose of blurring the image.
The n-dimensional normal distribution equation is:
(1-1)
Where is the standard deviation of the normal distribution, the larger the value, the more blurred (smooth) the image. R is the fuzzy radius, and the blur radius is the distance from the template element to the center of the template. If the two-dimensional template size is m*n, the Gaussian formula for the element (x, y) on the template is:
(1-2)
In two-dimensional space, the contours of the surface generated by this formula are concentric circles that begin with a normal distribution from the center, as shown in Figure 2.1. A convolution matrix that distributes nonzero pixels is transformed from the original image. The value of each pixel is a weighted average of the surrounding neighboring pixel values. The value of the original pixel has the largest Gaussian distribution value, so there is a maximum weight, and the neighboring pixels are getting farther away from the original pixels, and their weights are getting smaller. This blurring process preserves the edge effect more than other equalization fuzzy filters.
Theoretically, the distribution of each point in the image is nonzero, which means that each pixel needs to contain an entire image. In practical applications, when calculating the discrete approximation of a Gaussian function, pixels outside the approximate 3σ distance can be considered as ineffective, and the calculation of these pixels can be ignored. In general, the image processing program only needs the computed matrix to guarantee the relevant pixel influence. Ivigos Blur of 2.2 images
According to the value of σ, calculate the size of the Gaussian template matrix (), using the formula (1-2) to calculate the value of the Gaussian template matrix, and the original image convolution, you can obtain the original image of the smooth (Gaussian blur) image. To ensure that the elements in the template matrix are between [0,1], the template matrix must be normalized. The Gaussian template for 5*5 is shown in table 2.1.
The figure below is a schematic of the Gaussian template convolution calculation for 5*5. The Gaussian template is center symmetric.
2.3 Separating Gaussian Blur
As shown in Figure 2.3, the use of two-dimensional Gaussian template to achieve the purpose of the fuzzy image, but due to the relationship between the template matrix caused by the edge image loss (2.3 b,c), the larger the missing pixels, the drop template will cause black edge (2.3 D). More importantly, the Gaussian template (Gaussian core) and convolution operations will be significantly increased when it becomes larger. According to the separation of Gaussian function, the Ivigos fuzzy function can be improved.
The separation of Gaussian functions means that the effects obtained by using a two-dimensional matrix transformation can also be obtained by a Gaussian matrix transformation in horizontal direction and a Gaussian matrix transformation in the vertical direction. From a computational point of view, this is a useful feature, because this only requires a secondary calculation, and the two-dimensional irreducible matrix requires a secondary calculation, wherein the M,N is the dimension of the Gaussian matrix, M,n is the dimension of the two-dimensional image.
In addition, the two-time one-dimensional Gaussian convolution eliminates the edges produced by the Ivigos matrix.
Appendix 1 is a Ivigos fuzzy and discrete Gaussian blur implemented with opencv2.2. Table 2.2 Compares these two methods with the Gaussian Blur program implemented by the opencv2.3 Open Source Library.
3. Scale Space Extremum detection
The scale space is represented by a Gaussian pyramid. Tony Lindeberg points out that the scale-normalized log (laplacion of Gaussian) operator has true scale invariance, and Lowe uses the Gaussian difference pyramid approximation log operator to detect stable key points in scale space. 3.1 Scale space theory
Scale space, the idea was first proposed by Iijima in 1962, after the Witkin and Koenderink and other people's promotion gradually get attention, in the computer vision neighborhood widely used.
The basic idea of scale space theory is to introduce a parameter which is regarded as scale in the image information processing model, obtain the scale space representation sequence under multi-scale by continuous changing scale parameter, and extract the main contour of the scale space, and take the main contour as a eigenvector to realize the edge and Corner detection and feature extraction at different resolutions.
The scale-space method incorporates the traditional single-scale image information processing technology into the dynamic analysis framework, which makes it easier to obtain the essential features of the image. In the scale space, the fuzzy degree of each scale image becomes larger, which can simulate the formation of the target on the retina from near to distant target.
The scale space satisfies the visual invariance. The visual interpretation of this invariance is as follows: When we look at an object with our eyes, the brightness level and contrast of the retina-aware image are different when the illumination condition of the object's background changes, so it is required that the analysis of the image by the scale-space operator is not affected by the gray level and contrast of the image. That is to meet the gray-scale invariance and contrast invariance. On the other hand, relative to a fixed coordinate system, when the relative position of the observer and the object changes, the position, size, angle and shape of the image that the retina perceives are different, so it is required that the scale space operator is independent of the image's position, size, angle and affine transformation, which satisfies the translation invariance, Scale invariance, Euclidean invariance and affine invariance. representation of 3.2 scale space
The scale space of an image is defined as a variation of the scale of the Gaussian function with the original image convolution.
(3-1)
where * denotes convolution operations,
(3-2)
As with the formula (1-2), M,n represents the dimension of the Gaussian template (determined by). (x, y) represents the pixel position of the image. is a scale space factor, the smaller the value, the less the image is smoothed, and the smaller the corresponding scale. The large scale corresponds to the general image feature, and the small scale corresponds to the detail feature of the image. Construction of 3.3 Gauss Pyramid when the scale space is implemented, the Gaussian pyramid is used to indicate that the Gaussian pyramid is built into two parts: 1. Gaussian blur of different scales for the image;
2. Reduce the image to sample (sampling).
Image pyramid model refers to the original image is continuously reduced to sample, a series of different sizes of images, from large to small, from bottom to top of the tower-shaped model. The original image is the first layer of the gold tower, and each time the new image of the sample is reduced to a layer of the pyramid (one image per layer), and each pyramid has a total of n layers. The number of layers of the pyramid is determined by the original size of the image and the size of the tower image, which is calculated as follows:
(3-3)
Where M,n is the size of the original image, T is the number of the smallest dimension of the tower image. For example, for an image of size 512*512, the size of the image on the pyramid is shown in table 3.1, when the tower image is 4*4, n=7, when the tower image is 2*2, n=8.
In order to make the scale reflect its continuity, Gaussian pyramid filter is added on the basis of simple drop sampling. As shown in Figure 3.1, the image pyramid each layer of an image using different parameters to do Gaussian blur, so that each layer of the pyramid contains multiple Gaussian blur images, the pyramid each layer of multiple images are called a group (Octave), the pyramid has only one set of images per layer, the number of groups and the pyramid layer is equal, using the formula (3-3) calculation , each group contains more than one (also called layer interval) image. In addition, the initial image (underlying image) of a set of images on a Gaussian pyramid is sampled from the third-to-last image compartment of the previous set of images when the sample is dropped.
Note: Because of multiple images in the group stack, so many images within the group is also called multi-layer, in order to avoid confusion with the concept of pyramid layer, in this article, if not specifically described is the pyramid layer, the layer generally refers to the layers of images within the group.
Note: As shown in section 3.4, in order to detect the extreme points of S scale in each group, the dog pyramid needs to be s+2 layer image, and the dog pyramid is subtracted from the two adjacent layers of the Gaussian pyramid, then the Gaussian pyramid needs s+3 layer images per group, and the actual calculation S is between 3 and 5. When taking s=3, it is assumed that the Gaussian Pyramid store index is as follows:
Group No. 0 (i.e. section 1): 0 1 2 3 4 5
Group 1th: 6 7 8 9 10 11
Group 2nd:.
Then the first image of group 2nd is sampled according to the image of index 9 in the first group, and the others are similar.
3.4 Gaussian differential pyramid
2002 Mikolajczyk in a detailed experimental comparison, the maximum and minimum values of the Laplace function of scale normalization are compared with other feature extraction functions, such as gradient, Hessian or Harris angle features, which can produce the most stable image features.
As early as 1994, Lindeberg found that the Gaussian difference function (difference of Gaussian, referred to as the dog operator) is very similar to the Laplace function of scale normalization. The relationships can be deduced from the following formula:
Using differential approximation instead of differential, there are:
So there are
Where K-1 is a constant, it does not affect the location of the extremum point to be obtained.
As shown in Figure 3.2, the red curve represents the Gaussian difference operator, and the blue curve represents the Laplace operator. Lowe uses a more efficient Gaussian difference operator instead of the Laplace operator for extremum detection, as follows:
(3-4)
In the actual calculation, the Gaussian pyramid is used to subtract the upper and lower two layers of each group, and the Gauss difference image is obtained, as shown in Figure 3.3, the extremum is detected.
3.5 Spatial Extreme Point Detection (initial exploration of key points)
The key points are composed of the local extreme points of the dog space, and the initial exploration of the key points is accomplished by comparing the two layers of each dog in the same group. In order to find the extreme point of the dog function, each pixel is compared with all its neighboring points to see if it is larger or smaller than the neighboring points of its image and scale fields. As shown in Figure 3.4, the intermediate detection point and its 8 adjacent points of the same scale and the upper and lower adjacent scale corresponding to the 9x2 point of a total of 26 points compared to ensure that both the scale space and the two-dimensional image space are detected extreme points.
Due to the comparison at the adjacent scale, as shown in Figure 3.3, each group of 4-layer Gaussian differential gold tower, only in the middle two layer of the extreme point detection of two scales, other scales can only be carried out in different groups. In order to detect the extreme points of s scales in each group, the dog pyramid needs to be s+2 layer images per group, while the dog pyramid is subtracted from two adjacent layers of the Gaussian pyramid, then each group of Gaussian pyramid needs s+3 layer image, the actual calculation s between 3 and 5.
Of course, the extreme points produced are not all stable feature points, because some extreme point response is weaker, and the dog operator produces a strong edge response. 3.6 parameters to be determined for building scale space
-scale space coordinates
Number of O-groups (octave)
Number of layers in S-group
In the above scale space, the relationship between O and S is as follows:
(3-5)
This is the base layer scale, O is the index of the group octave, and S is the index of the group inner layer. The scale coordinates of the key point are calculated by using the formula (3-5) based on the group and the layer within the group where the key is located.
At the very beginning of the Gaussian pyramid, it is necessary to pre-blur the input image as the No. 0-level image of the No. 0 group, which is equivalent to discarding the highest spatial sampling rate. It is therefore common practice to increase the scale of the image by one-fold to create a group of 1. We assume that the initial input image is a Gaussian blur that has been used in order to combat confusion, which is equivalent if the size of the input image is enlarged by bilinear interpolation by one-fold.
The k in the pickup (3-4) is the reciprocal of the total number of layers in the group, i.e.
(3-6)
When constructing a Gaussian pyramid, the scale coordinates of each layer within the group are calculated as follows:
(3-7)
The initial scale, Lowe, S is the layer index within the group, and the same group-scale coordinates are the same in different groups. The next layer of image in the group is pressed by the previous image