Virtual View Synthesis Method and self Evaluation Metrics for free Viewpoint television and 3D Video

Source: Internet
Author: User

This article is a good article in the sense of virtual viewpoint drawing paper, so simple translation, for reference.

Virtual viewpoint generation method and self-evaluation index for free viewpoint TV and 3D video

Summary

Virtual Viewpoint generation is one of the most important techniques to realize free-view TV and three-dimensional film. In this paper, we propose a virtual viewpoint generation method to produce high quality intermediate viewpoints in these applications, and propose a new evaluation index called SPSNR and Tpsnr in order to measure the consistency of space and time respectively, the proposed virtual viewpoint generation method includes five steps: Deep preprocessing, Depth-based 3D mapping, based on depth of histogram matching, the main point of view plus auxiliary viewpoints of the point of view mixing, based on the depth of the hole fill. The effect of the proposed viewpoint generation method is validated by using various indexes, such as peak Snr Psnr, structural similarity ssim, video quality index VQM based on DCT, and the quality evaluation of the newly proposed index. We have also proved that the synthesized image is natural in both objective and subjective.

1. Introduction.

3D Video gives users an immersive feel and is now being used as a key technology to drive the next wave of multimedia experiences, such as 3D movies, 3D TVs, 3D displays and 3D mobile services.

The most critical technology modules in the 3D production chain are coding and drawing. Due to the massive amount of data generated, the role of effective coding is becoming more and more important for 3D systems. In the past, a number of research institutions and standardization organizations have worked to address this problem, including the MPEG-2 multi-View video brief (MVP), MPEG-4 's various auxiliary parts (MACS), and MPEG/JVT Multi-View video coding (MVC).

Recently, MPEG has launched a major work for 3D video applications. However, the previous MPEG/JVT organization's work on MVC was mainly about improving the coding efficiency of the general multi-view coding scene, and now the goal of the work is a wider range of technologies, including the problem of depth estimation, coding, drawing. One of the most critical design ideas is the application of a depth image with camera parameters to draw an intermediate viewpoint on free-view navigation or 3D display.

On the other hand, due to the growing diversification in 3D services and 3D displays, the proper drawing of 3D viewpoints is essential. In other words, it needs to resample the viewpoint or adjust each viewpoint separately based on the number of viewpoints and the amount of resolution required to display it. For applications like free-view video, the case is when there are more viewpoints that need to be drawn onto the display but not really encoded, resampling means generating a virtual viewpoint based on a real viewpoint. The problem of arbitrary viewpoints that produce 3D scenes is a hot research topic in computer graphics field. In many drawing techniques, the image-based rendering technology has been paid much attention in recent years, these techniques have been widely used in the application of images rather than geometry as primitive pairs to draw virtual viewpoints, according to the number of collection information is often divided into the following three categories: No geometry information to draw, explicit geometric information and implicit geometric information drawing. All-optical model light, light field rendering, lumen map, light space and other technologies are not used to draw the geometry information, in this method, the synthetic virtual point of view quality is often dependent on the distance of the baseline and the number of available viewpoints within a limited perspective, the greater the composition quality. On the other hand, the application of 3D map and layered depth image based on depth image is the second type, while the IBR of deformation and viewpoint interpolation is the third kind. It is clear that the composition of the viewpoint mass depends on the accuracy of the geometric information in a large part of these display or implicit geometry-based drawing methods.

In this article, we propose a new algorithm of image generation in the field of free-view video and 3D video and a new evaluation measure to detect the temporal-spatial consistency of synthetic viewpoints, the proposed viewpoint synthesis method has five steps, depth preprocessing, depth-based 3D mapping, histogram matching based on depth graph, Based on the auxiliary viewpoint, the cavity fills based on the depth graph. First, preprocessing is performed in a time-space consistency that requires scene depth data in order to correct errors and increase depth values. Second, the depth-based 3D mapping technique is adopted to avoid the discontinuity error that directly maps the texture causing the rounding error. Thirdly, the depth histogram matching algorithm is applied to reduce the illumination difference between two reference viewpoints. The main viewpoint plus the auxiliary viewpoint blend is introduced to mix 3D map reference images in a more robust way to combat the inaccuracy of depth information and camera parameters. Finally, the hole filling technique based on depth images is applied to fill the remaining voids by applying depth-based remediation techniques. The synthetic viewpoints are peak signal-to-noise ratio (PSNR), structural Similarity (SSIM), DCT-based video quality metrics (VQM), and the latest proposed space PSNR (Spsnr) and Time PSNR (Tpsnr).

The remainder of this article is organized as follows. In the second part, we describe the basis of the viewpoint synthesis technique. In the third and fourth part, we put forward the viewpoint synthesis algorithm and the evaluation criterion respectively. Part V, we show and evaluate the performance of the proposed algorithm. Part VI summarizes the article.

2. Background

This section briefly reviews the camera's geometric model and the traditional method based on depth viewpoint synthesis.

A Camera Geometry Model

The general keyhole imaging is used as a model, its optical center C and its image plane I. A 3D point W is mapped to an image point m as a line of C and W points and an intersection of the image plane I. The line containing the C point and the orthogonal plane I is called the optical axis (Z) and the part it intersects with I is the primary point (p). The distance between C and I is the focal length.

The w=[x,y,z]t is the point of W in the world coordinate system of the reference frame (fixed arbitrarily) and m=[u,v]t is the coordinate point of M in the image plane. The mapping from 3D coordinates to 2D coordinates is a perspective projection, which represents a linear transformation of homogeneous coordinates. Make =[u,v,1]t, =[x,y,z,1]t, respectively, the homogeneous coordinates of M and W; then, the Perspective transformation matrix:

K is called the scale factor of the perspective, and K becomes the true orthogonal distance is the point to the camera's focal plane. As a result, the camera is replaced by a perspective matrix (from a simple camera matrix henceforth) that can be decomposed into the following results using a QR factor:

The matrix A depends only on the internal parameters and has the following form:

Au=-fku,av=-fkv are the focal lengths of the horizontal and vertical pixels (f is the MM focal length, KU and KV are the number of pixels per millimeter along the U and V axes), (U0,V0) is the main point (the focus of the optical axis and projection plane in Figure 1), and R is the distortion factor (determined by the non-orthogonal u-v axis).

The position and orientation of the camera are represented by the 3x3 rotation matrix R and T and the exact conversion from the camera reference frame to the world reference frame, respectively.

B. Depth-based viewpoint synthesis

A typical depth-based virtual viewpoint generation system is shown in schematic 2. The goal of this system is to use camera parameters, texture images, and depth images from his neighboring viewpoints to generate virtual viewpoints.

3D image mapping is a key technique based on the virtual viewpoint of depth image, in 3D mapping, the pixel point of reference image is mapped back to 3D space and then re-projected to target viewpoint 3.

Equations 4 and 5 represent the process of reverse projection and re-projection, respectively.

A,r, and T is the camera parameter, and D represents the depth value of the point in 3d space that needs to be reversed and re-projected. The coordinates (L,M,N) in Equation 5 are normalized (l/n,m/n,1) and represent the integer coordinates (U,V) of the virtual camera.

3. Projection Viewpoint generation algorithm

The proposed viewpoint generation algorithm includes five steps: depth preprocessing, depth-based 3D mapping, depth-based histogram matching, main viewpoint plus auxiliary viewpoint blending, and hole filling based on depth graph. Figure 4 shows the proposed viewpoint generation scheme, followed by sub-small chapter pairs is a detailed description of the sub-algorithm

A. Deep preprocessing

In general, depth data can be obtained through a specific depth camera system and computer image tools, or through the depth estimation algorithm for mathematical calculations, the current depth estimation is the most popular method and is the focus of research, because the depth of the camera is too expensive, the computer graphics image can not represent real-time scenes.

However, the depth data computed by the number sequence results in an incorrect value in a particular area of the image or an inconsistency that passes through time and space, due to local characteristics in the depth estimation process. Joint depth data These problems can result in various visual artifacts on the generated image. In order to solve these problems, we put forward the preprocessing depth data, the advanced preprocessing includes three steps: Time filter, original error compensation, and spatial filtering. The main one is to apply a median filter instead of a mean filter because the result of the mean filter is to produce a new pixel value, the resulting pixel value does not exist on the original depth image, so it degrades the rendering quality.

As a first step, we apply a one-dimensional median filter along a continuous depth image of the frame's position pixels. He aims to reduce the temporal discontinuity of depth values belonging to the same object or background. In this article, we apply the following median filter:

Xi,j,t is a set of pixels at a space point (i,j) at the moment T makes the pixel value, Ji,j,t is a 3x3 group centered on the space-time position (i,j,t), and R is the threshold that determines whether the filter is applied.

The next step is to compensate for the original error, which is due to the erroneous fusion of foreground and background in the traditional depth estimation process. Usually, he occurs when the current scene and background have the same texture. People's eyes can easily identify them, but it is a difficult task for automated algorithms. In this article, we use image expansion and corrosion (equation 7 and Equation 8) respectively to correct the error. Because the foreground has a background depth value, the image quality will be worse than the one generated in the opposite direction. In the proposed scheme, the image expansion takes precedence over the image corrosion operation.

A represents an image, and B is an element that acts on a structure. AB is the area marked with B, and (x, y) is a pixel in a. In this article, we use the elements of the circular structure and the radius of the circle is set to 5.

The final step has to do with smoothing outliers in the estimated depth image using a 2-D median filter. He smoothed the outliers of the objects in the depth image and removed unwanted noise. In this article, we use a 5*5 median filter for each pixel at the (i,j) position, as follows:

Ji,j is a set of pixels that consists of a 5x5 window around the position (I,J).

Figure 5 shows the results of each step in the application of the depth map preprocessing method presented by Microsoft Research Asia (MSR) for the hip-hop video sequence. The proposed scheme is evident in particular to the dancers on the left rear of the two standing men's faces and on the floor of the dancers ' borders. The method of depth preprocessing not only compensates the original depth error, but also restores the time-space consistency. Therefore, the pre-processing depth map will lead to a significant improvement in the subjective and objective quality of the generated image.

B. Depth-based 3D mapping

Most of the previous viewpoint generation algorithms generate texture images with corresponding depth image maps. However, direct 3D mapping from a texture image from a neighboring viewpoint to a virtual image plane often results in a false black outline of the virtual viewpoint image generated in Figure 6. These contours are caused by rounding errors, which involve integer representations of the virtual viewpoint coordinates and the original depth values of the errors.

However, once a virtual viewpoint corresponds to the depth of the image, through the inverse mapping, we often get accurate texture values by using it from the neighboring viewpoint without producing the wrong black profile in the virtual viewpoint. In order to obtain the virtual viewpoint texture image corresponding to the depth image, we first map the depth value of the reference viewpoint, noting that the wrong black contour and texture map have the same reason to appear on the mapped depth image. In order to remove the contour of these errors, we apply the median filter. Figure 7 shows the above procedure.

C. Depth-based histogram matching

If we have two reference viewpoints as shown in Figure 2, we have a virtual viewpoint of birthdays. We can get two images of the viewpoint after 3D mapping. In other words, each viewpoint generates one. Before mixing two map images, we applied histogram histogram matching to reduce the illumination and color differences between the two images that generated the viewpoint image. Based on the histogram matching algorithm, we consider the distribution of the cumulative histogram to modify the mapping conditions, and then use this modified histogram matching by using the depth-based segmentation to the culture.

The histogram of the two virtual viewpoints generated from the reference viewpoint is analyzed, and the images of these 3D mappings are adjusted to have the same distribution. The whole process is as follows. The first step is to modify the image of two 3D maps to have the same hole, and then use median filtering to get noise. As shown in 8. The accuracy of the histogram matching is further improved by using the modified image instead of the original 3D map image.

The second step is to calculate the histogram of left and right images. Let Yl[m,n] represent the brightness of the left graph, and then his histogram is calculated as follows:

In (10), W represents the width of the image, H represents the height of the image, and the value of V is from 0 to 255. The histogram matching is done by mapping the left and right images to the virtual viewpoint. These two are necessary to produce the mapping equation m. First, the left graph cumulative histogram cl[v] was created:

The histogram on the right hr[v] and the cumulative histogram can be calculated in the same way. The left and right two images mapped to the virtual viewpoints have the same holes, 8 (c) and 8 (d) through median filtering, so that, in addition to their illumination, the two images have almost identical textures.

Based on the cumulative histogram, we do a cumulative histogram of the virtual viewpoints Cv[v]:

Here CL and CR are the cumulative histogram of the left and right images respectively. In general, the weight factor A is calculated based on the baseline distance, as follows:

T is the translation vector for each viewpoint.

The mapping equation of left and virtual images is obtained by matching the number of left reference images to the number of virtual images. As in the equation (14), Figure 9 shows an example.

The cumulative mapping function is applied to the left figure Yl[m,n], the histogram matches the result of the image Yhml[m,n] as the formula (15), the histogram on the right is yhmr[m,n] with the same formula to calculate.

In general, we assume that the brightness of each camera causes a difference in illuminance and color, and affects each object and color component, respectively. By considering the above assumptions, we use the regional histogram matching to divide the region using depth values. Figure 10 shows the rough segmentation of the image in Figure 8 (d).

When the previous histogram match was transferred from one viewpoint to another, there was the same histogram, the proposed histogram modified two viewpoints to have the same distribution, the virtual viewpoint image generated by the baseline distance was taken into account, and the proposed histogram match 9 were mapped exponentially.

Figure 11 shows an example of the proposed histogram match, in which case the histogram of the left and right viewpoints of the 3D map has the same shape but has different distributions due to illuminance and color differences. By mapping the two reference viewpoints is that the virtual viewpoint has the same cumulative histogram, we can reduce the illumination difference between the two viewpoints. The proposed histogram matches the component of each color independently applied to the RGB format.

D. Blending of the main viewpoint plus the auxiliary viewpoints

Because the camera parameters are inaccurate and the depth of the image and texture image boundaries match the inaccuracy resulting in large holes in the edge error. To remove these visible errors, our image expands to expand the void boundary, 12. These expanded cavities can be filled with another 3D map viewpoint. We hope that by removing this error it is more natural to generate the viewpoint.

The next step is the viewpoint blending, combined with the 3D mapping viewpoint to the virtual viewpoint, the simplest method is to use the weighted sum of two images as follows:

Here, IL and IR are the obtained reference viewpoint texture images of the 3D map, and IV is the result of the blending, in general, the weight factor A is calculated based on the baseline distance (13).

However, the disadvantage of this method is that the two viewpoints of inconsistent pixel values (the reason is, for example, camera parameters, depth values inconsistent, etc.) will have a role in mapping images, often lead to inconsistencies, the use of bilateral edge artifacts, peace slip, 13 is shown. To avoid this error, we define the main and auxiliary viewpoints for image blending. The main point of view is the main reference viewpoint, and most of his pixel values are mapped, and the auxiliary viewpoints are used as complementary viewpoints for cavities. Then (16) is rewritten as (17), in the 3D map main point of view, here is not the empty area of a equals 1, the void area of a is 0, in other words, the majority of the mixed point of view from the main point of view, and the rest of the hole is assisted by the point of view, we choose a closer view of the virtual

Here IB is the main point of view and IA is an auxiliary viewpoint.

E. Fill with holes based on deep repair

The final step of the proposed virtual viewpoint synthesis method is to fill the hole based on depth. Although the viewpoint blend effectively fills most of the empty areas, there are still some voids, which in general are caused by the existence of void areas and erroneous depth values, which are defined as areas that are not visible in the reference image, but exist in the synthetic viewpoint, Many of the existing void filling methods are image interpolation and repair techniques that fill the remaining areas with adjacent pixels based on the geometric distance. However, research observations have found that filling voids with background pixels rather than foreground pixels is more realistic, and these voids are clearly background, so we propose an empty fill algorithm that uses background pixels rather than foreground pixels to consider existing remediation techniques.

The general fix algorithm problem is as follows: the need to repair the area O and his boundary is defined, the pixel p belongs to O will be repaired by the neighboring area be (p), 14.

This idea is quite reasonable for the general image repair, but it is changed in the viewpoint generation to apply to void fill. Because a particular void area is both a foreground and a background, in this case we replace the boundary area with the foreground counterpart, which, as described in (18), is on the opposite side. That is to say, we deliberately manipulate the void so that it has a neighboring pixel that belongs to the background, as shown in 15.

FG and BG respectively belong to the foreground and background.

To discern the foreground and background, we use the corresponding depth data. In other words, for a hole in which the two-pixel horizontal direction boundary is opposite to each other, we think that the pixel has a greater depth value for the foreground, and vice versa, and figure 16 shows the results of the previous fixes and proposed depth-based remediation techniques.

4 Self-evaluation indicators

In order to evaluate the performance of the virtual viewpoint generation algorithm, we generally measure the similarity between the generated viewpoints and the existing original viewpoints, PSNR,SSIM,VQM is widely used, but these are only useful when the original viewpoint is available to the virtual viewpoint. Furthermore, they are not able to measure time consistency, and time consistency is very sensitive to changes in lighting and focal mismatches, and is also quite sensitive to the human eye.

In order to overcome the limitations of the existing evaluation methods, we propose a new evaluation index called Space Psnr (Spsnr) and Time Psnr (Tpsnr). Spsnr measures spatial consistency by examining the spatial noise of the synthetic viewpoint. In general, the viewpoint synthesis increases the high frequency component because the 3D map image and the hole have many high-frequency components. Therefore, we can evaluate the spatial consistency by examining the degree of high-frequency components. From the above view, S-psnr is defined as follows:

Here h and W represent the height and width of the image. We use the 5*5 median filter as a structure to remove the spatial noise, and unlike the rock image, it contains only high-frequency components. We are similar to the MSE defined in Psnr, which defines the number of high-frequency components as SMSE.

TSPNR is used to measure time consistency and Spsnr similarity. In addition to the input image being replaced by successive frames in 20, the TPSNR measures the high-frequency component of the time, and the main indicator of the proposed method is that he only uses the generated viewpoint itself.

5. Test results and analysis

We have tested the proposed algorithm on two test sequences, hip-hop and ballet. In 8 viewpoints, viewpoint 3 and Observer 5 are selected as reference viewpoints, and viewpoint 4 is chosen as the virtual viewpoint for the generation. Each algorithm of the proposed method is evaluated by the existing objective evaluation method such as SPNR,SSIM,VQM and the proposed Spnr,tpsnr. The larger the Psnr,spsnr,tpsnr, the better the quality, the VQM the opposite. Ssim, the closer the value is to 1, the better the quality. The 2.3 version of Viewpoint Generation software released by Nagoya University is currently being used as the MPEG Ftv/3d video standard activity as a reference software compared with our proposed method, the default viewpoint blending method is replaced by the method of the main viewpoint plus the viewpoint, which makes the comparison more meaningful.

A Experimental results of deep pretreatment

The results of the deep preprocessing are given in table 1, and they correspond to the generated image shown in example 17. In the existing evaluation methods, the depth pretreatment does not improve the image quality obviously, but he has been improved in Spsnr and Tpsnr. In particular, we are able to determine that the time consistency of the ballet sequences has improved. Moreover, we are able to identify some of the improvements, such as the natural smooth boundaries of the dancers on the floor and the edge of the head of the dancers standing on the left.

B. Experimental results based on histogram matching of depth image

As shown in Table II and Figure 18, the proposed histogram match improves subjective quality by reducing illuminance and color changes, but the objective quality is a little lower.

C. Test results based on deep repair

Table 3 shows the experimental results based on depth repair, which correspond to the composite sample image 19. When the void boundary has both foreground and background, the proposed depth-based repair method fills the void only through the pixels in the background. We can confirm that the proposed method has been improved in both subjective and objective quality.

D The experimental results of the viewpoint generation method presented

The proposed viewpoint generation method includes various sub-algorithms, such as deep preprocessing, depth-based 3D mapping, depth-based histogram matching, main plus auxiliary viewpoint blending, and memory depth restoration void filling. In this part, a comparison is made between the viewpoint generation method and the reference viewpoint generation software. The main tools of the reference software are based on depth 3D mapping, image restoration to fill holes, weighted and viewpoint blending. In this experiment, we use the main viewpoint and the auxiliary viewpoint method instead of the reference viewpoint to generate the viewpoint mixture in the software. Table 4 shows the experimental results, they correspond to the generation of the sample image 20 and Figure 21. We can determine that the proposed method of viewpoint generation synthesis image is more subjective and objective than the reference software good.

6. Summary

In this article, we have proposed the virtual viewpoint generation method and the self-evaluation indicator applied in free-view video and 3D video. The proposed method consists of four steps: depth preprocessing, depth-based 3D mapping, illumination and color difference compensation based on depth histogram matching, cavity filling based on depth-repair technique. Moreover, the method of the main viewpoint plus the auxiliary viewpoint is more weighted than the viewpoint mixing and has better subjective quality. The effect of the proposed method is determined by evaluating the synthesized image, and various quality measurement indexes including the self-evaluation index SPSNR and Tpsnr are used. We find that the results of the proposed method are subjective and objectively better than the reference software currently used by MPEG Ftv/3d video organizations.


Virtual View Synthesis Method and self Evaluation Metrics for free Viewpoint television and 3D Video

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.