Knowledge and principles of computer vision technology in synthetic panorama

Source: Internet
Author: User

Perhaps a lot of friends will be curious about the mystery, such as why the different angles of the picture stitching can be automatically aligned (slightly accurate is how to deal with the image of the affine distortion and perspective distortion); How can I automatically find the parts of the picture that can be glued together and stitch them up correctly? How to balance the difference of light tones between pictures and so on. In fact, each step behind more or less complex but quite sophisticated technology and algorithms, so this attempt to use a more intuitive way to introduce each step of the work principle, reversing the forefront of science and technology "although unknown but feel Li" view.

Then describe the whole process first:

---between---pictures
1. feature Point matching : Find the image part of the picture.
2. Image matching : Connect matching feature points to estimate geometric transformations between images.

---global optimization and seamless connectivity---
3. Panorama straightening : Corrects the relative 3D rotation of the camera when taking pictures, mainly because the camera is probably not on the same horizontal line when taking pictures, and there are different degrees of tilt, skipping this step may cause the panorama to become a wave shape.
4. Image Equalization Compensation : Balance the light and hue of all pictures globally.
5. Image Band Fusion : After step 4 there will still be a link edge between the images, a vignette effect (the brightness or saturation of the peripheral part of the image is lower than the center area), and the Parallax effect (due to the camera lens movement).

----------------(a) feature point matching ----------------

The first step in synthesizing a panorama is to extract and match the local feature points of all the footage images.

1. What are feature points?

Generally speaking, a picture contains a feature point is usually surrounded by a large amount of information points, and only through these characteristics of the local, the basic can be used to push the whole picture. For example, the edges and corners of objects, the stars shining at night, or the patterns and patterns in pictures.


The frame in the picture is the part that determines the characteristic.

In fact, when a person identifies something or a pattern, it is also by observing what kind of characteristic the object has, and then matching it with his experience and memory. For example, to go to the supermarket to see a fruit, suppose we observe that the fruit is green (color characteristic), spherical (Shape feature), with black texture (pattern feature). So we can tell by experience that this is a watermelon. Of course, people will also use a variety of other characteristics to enhance the judgment of the object, but the whole from receiving information to make judgments of the process of flowing, and even not aware of themselves. This is the subtlety of biological vision, but the distance to solve the biological vision system seems to be still far away, so in the neural network to crack and simulate the operation of the biological brain before the computer vision to bear the burden.

Slightly off-topic, then the question before continuing: What features are extracted? And how to extract features?

As mentioned above, a picture has a variety of characteristics, simple can be color, shape or pattern, complex such as can be the pattern of self-similarity (whether there is a similar repetition pattern) or the whole scene of other objects (such as in the supermarket fruit shelf we can be more confident that watermelon is watermelon, The background of the fruit rack is characteristic. In fact, extracting these features in the field of computer vision also occupies a very important position, after all, good feature selection is the main premise that the whole identification system can be successfully operated.

[Theoretical part]

In the process of synthetic panorama, the use of "point matching", that is, matching the local picture information. Because the local image can extract the characteristics of limited, such as the shape of the characteristics of the situation is not appropriate, it is generally used through the whole local pattern to determine whether the condition as a feature. Among them, the most common is the SIFT feature (scale-invariant feature transform), that is, scale invariant feature conversion . Its applications include object identification, robot map sensing and navigation, image stitching, 3D model building, gesture recognition, image tracking, and motion alignment. This algorithm, published by David Lowe in 1999, is still widely regarded as the best feature detection and characterization algorithm in the rapidly evolving field of computer vision today (i.e. State of the art).

Extracting SIFT features is divided into two steps: detection and description (formation of eigenvectors).
In simple terms, detection is the scanning of all the positions of the image at all scales (scales can be understood as local scaling). To judge a point as a feature point, it is necessary to calculate the local extremum of the Gaussian difference (dog,difference of Gaussians) at this scale (local extrema). As shown in the first figure, the local image is measured at different scales (the convolution of the local image and the Gaussian filter) at the same size (scale). The local Extremum calculation method, as shown in the second picture, compares the value of the test pixel's dog with its 26 adjacent pixels to its maximum minimum extremum, which is the "characteristic degree" of the test pixel.

Even if the feature points are found, it is not feasible to match the local images around the feature points. First of all, the SIFT feature itself is the greatest advantage of "position, scale, rotation invariant", that is, in a certain range of rotation or pull the camera closer, you can detect the same feature point location (repeatability) and extract the same feature vector ( Scale & Rotation invariant), so images taken at different angle conditions can be matched by sift features. And the pixel values do not have these characteristics. In addition, the pixel information of local image is large, which leads to low matching efficiency. Assuming that the local size is set to 16*16 pixel size, each pixel value corresponds to an RGB 3 value, then each feature point corresponds to a 786-dimensional vector (that is, each point to be represented by 786 numbers), taking into account that the scale and rotation invariants this vector also need to multiply more than 10 times times even dozens of times times to cover all the changes, Assuming that a picture contains thousands of feature points, and the database contains thousands of images, the huge amount of computation in the match is undoubtedly a disaster for the CPU.

In order to ensure that the extracted eigenvector has " rotation invariant " characteristics, the first thing to do is to find a "direction" at each feature point, and this direction and local image characteristics should be consistent, that is, if the ideal state to rotate the local image, then re-extract the "direction" Also rotates the same angle as before. The way to choose this angle in the SIFT is to calculate the "angle" and "size" of each pixel in a local image after the Gaussian blur, and then to vote for the angle of the most pixels in the main direction. To calculate the angle and size of the formula, it seems very long to actually just calculate the difference between the adjacent pixels and (m) and the angle (θ).

Thus the difference and the angle produced by the local image can be formed in this way eigenvector (or key point descriptor -KeyPoint descriptor).

The way to form this vector is to "quantify" these angles (for example, divide 360 degrees evenly into 8 equal portions) and then accumulate them based on location. Represents an 8x8 local sample computed by the 2x2 feature Matrix, which is in fact generally used by 16x16 local sampling to form a 4x4 feature matrix. So if you pull this matrix into a vector, that is, it contains 8*4*4 = 128 elements.

After the formation of eigenvectors, the next question is how to match . The most basic way can be called "nearest neighbor Search" (Nearest neighbour), in fact, is found in 128-dimensional space on the nearest linear distance of the eigenvector, the way to find a straight line distance and 2-dimensional, the nearest eigenvector is also considered to match each other. The method used by the original author is to add the k-d tree algorithm to efficiently complete the nearest neighbor search on high dimensions.

[end of theoretical part]


----------------(ii) picture matching ----------------

The next goal is to find all the parts of the picture that match (that is, overlap) and then create a basic panorama after all the pictures. Because each picture has the potential to overlap with every other picture, matching all the images requires a square of the number of images to match. In fact, however, the relationship between the two images can be estimated by just a few relatively accurate points between each image.

The most common way is to use RANSAC(random sample Consensus, randomly sampled), where the purpose is to exclude matches that do not conform to most geometric transformations. These matching points are then used to " estimate the single-matrix " (homography estimation), which means that one of them is passed through an association and another matching method, will be described in detail later .

In order to use RANSAC to find the characteristic points that conform to the geometric constraints, the contents of the two pictures are aligned by the single-response matrix.

"RANSAC"

[Theoretical part]

First introduce the principle of RANSAC . Ransac is an iterative algorithm (iteration method), which is used to estimate the parameters of mathematical models from the observed data, and the data of the inner Group (inliers) and outlier (Outliers) can be separated. In short, there is often a lot of noise in the observed data, such as SIFT matches, which can sometimes cause matching errors due to similar patterns in different places. And the RANSAC is through repeated sampling, that is, from the entire observation data to be pumped some data to estimate the model parameters and all data errors how large and then take the minimum error as best as well as separating the inner group from the outlier data.

Here is a simple example of finding the most suitable line in a group of numbers. "." The hypothesis is that there is a set of groups that contains the inner and the outlying groups, where the inner group is a point that can be incorporated into the segment, and the group is not able to be combined. If we look at this line in a simple, least squares way, we will not get a line that fits within the group, because the least squares method will be affected by the impact of the group. And RANSAC, the model can be calculated only by the inner group, and the probability is high enough. However, RANSAC cannot guarantee that the result must be the best, so it is necessary to carefully select the parameters so that they can have enough probability.

is a simple example of using RANSAC to find out the parameters of a line that conforms to the inner group (which can be seen as finding A and B in y=ax+b), as well as the inner group data itself.

The reason why RANSAC is still accurate in a large amount of noise is mainly due to the fact that only part of the random sampling can avoid the impact of the estimated results on outliers. The process is the following: (from a wiki)


This process needs to be supplemented by determining whether the data conforms to the model's thresholds and how many times the steps are to be self-defined, and to determine the relative accuracy and computational time of the process.

To summarize, the input of the RANSAC algorithm is:

1. Observational data (including data from the inner and outer groups)
2. Models that conform to part of the observational data (models consistent with the inner group)
3. Minimum number of internal groups conforming to the model
4. Determine if the data conforms to the model threshold (tolerance between the data and the model)
5. Number of iterations (number of random inner groups extracted)

The output is:
1. Model parameters that best match the data (if the number of inner groups is less than the input third rule, the data does not exist for this model)
2. Intra-cluster (data that conforms to the model)

Advantages: The model parameters can be accurately found in the data containing a large number of outliers, and the parameters are not affected by the outer group.
Disadvantage: The calculation of parameters without a maximum time limit, that is, when the number of iterations is limited, the resulting parameter results may not be optimal, or even may not conform to the real inner group. Therefore, when setting the RANSAC parameter, it is important to consider "accuracy and efficiency" according to the application, so as to decide how many times to do the iterative operation. Setting the maximum error threshold for the model is also a self-tuning, depending on the application. Another point is that RANSAC can only estimate a model.

[end of theoretical part]

RANSAC applies to point-matching effects, based on the previous diagram:

Matches do not conform to the majority of geometric transformations are judged to match errors and excluded.


"homography Estimation"

After finding the correct match between the two graphs, you can use these points to estimate the geometric transformation relationship between the two graphs. Simply put: Suppose one of the pictures is fixed on the table, how to place it, rotate it, and stretch another picture to coincide with the first one.


"Waiting to be added"
Suppose there is a pair of matching points, and the single-matrix between them is H in the formula .

"Waiting to be added"

Knowledge and principles of computer vision technology in synthetic panorama

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.