3D pose estimation of a single image with self-occlusion: monocular image 3D Human pose estimation under self-occlusion (iccv 13)

Source: Internet
Author: User

Monocular image 3D Human pose estimationunder self-occlusion (iccv 13) 3D pose estimation of a single image under self occlusion

Abstract: This paper proposes a method for automatic 3D pose reconstruction in a single image. The presence of human joints, body parts that are prone to illusion, and messy background will lead to ambiguity in human posture judgment. This shows that this is not a simple problem. Researchers have studied many motion-and shadow-based methods to reduce ambiguity and reconstruct 3D gestures. The key idea of our algorithm is to add motion and direction restrictions. The first is the limitations added when the 3D model is mapped to the input image, which can be used to remove areas that are not suitable for the simulated human body. The other one adds a limit to the creation of a merging viewpoint to restore the input viewpoint from multiple vertices. After these limitations are applied, the 3D model maps to the original and merged viewpoint, which further reduces ambiguity. Finally, the final 3D posture is generated by using the locations without ambiguity in the viewpoint synthesized from the initial viewpoint. In the humaneva-I database, we conduct quantitative experiments to conduct qualitative analysis on non-restricted images in the image parse database. The results show that the proposed method is robust and can accurately reconstruct a 3D posture from a single image.

1. Introduction

Automatic Restoration of the 3D posture of the human body from a single image is a very challenging problem in machine vision. Due to the deformation of the hinge human body, the existence of Self-occlusion, and great degrees of freedom, the same person has a great number of actions under different environmental restrictions, therefore, it is ambiguous to estimate the human hinge structure from a single image. The solution will be applied in pedestrian detection and tracking, vehicle safety, video tagging, human behavior recognition and images.

Recent methods for restoring 3D posture from 2D images can be categorized as (1) Techniques Based on Data drives and (2) techniques for restoring structures from motion. The data-driven method uses the observed Image value or the 3D node mapped by the location of the 2D node to predict the 3D posture [1.5.6]. In contrast, the method for restoring a structure (SFM) from a structure is to obtain 2D nodes from the same targets in different images, propose 3D nodes, and obtain 2D nodes, we need to estimate the camera parameters, bone length, and position direction in the middle. Here we have integrated these two technologies to take advantages and disadvantages. For an input image, we use the existing 2D Human Body Detection sub to estimate the 2D joint position. Due to the limitations of Self-occlusion, we added a prediction step to process self-occlusion, improve the initial input, and then perform 3D pose estimation. Next, we map the 3D model to a 2D node and obtain a 3D posture with strong ambiguity. By increasing motion and geometric limitations, we reduce ambiguity. To solve all the remaining ambiguities, we use the twin-GP Regression Method to Predict the new viewpoint from the first viewpoint, and then map the 3D model to the original and merged viewpoint, to estimate the relative depth of a part. Finally, in order to solve the problem of location direction, we use the location correction of non-ambiguity in the synthetic viewpoint to have ambiguity in the initial viewpoint.
Main contributions of the article:
A 3D posture reconstruction frame is built on 2D images in difficult human posture scenarios;
The self-occlusion reasoning method improves the initialization steps, improves the accuracy of 2D gesture detection, and has been tested on the public database.
An automatic solution for location direction ambiguity is proposed, instead of relying on user input in [18.

2. background There are many articles about 3D pose reconstruction from 2D images. our attention is focused on 3D pose prediction using data-driven methods or restoring structures from motion (SFM.
The key part of the data-driven approach is the selection of image descriptors, output shapes, and prediction stages. Generally, this step is: (1) extract features from 2D images (2) use a pre-defined prediction sub-estimate 3D posture. The description sub Based on sparse regression, nearest neighbor, and feature description (SIFT) is in use and can be automatically reconstructed from 2D images to 3D gestures. [1, 3], such as Agarwal, uses contour information as the image description sub, followed by the correlation method of sparse regression (?) Map Advance outlines to 3D pose and apply them to human body tracking [2]. bo et al. [6] use different robust Descriptors (such as multi-layer sift feature to describe sub-blocks) and predict 3D pose in Bayesian frameworks. They use conditional Bayesian hybrid EXPERT models to directly map the corresponding 3D node locations from the observed values of the corresponding images.
Recently, Bo et al. [5] proposed a regression method for the double Gaussian process to estimate the 3D posture from the hog and hmax feature descriptors. One of the limitations of these methods is that they need a large amount of training data to shape the prediction, and to express the changes in the appearance of different people and points of view. Typical experiments based on these methods are executed only on the data controlled by the laboratory. In this article, we propose a method to reconstruct a 3D pose in an uncontrolled environment in the image/frame. In addition, the use of image descriptors, such as the [5] method, cannot ensure that the airspace information is captured in experience. In our method, these limitations are overcome, because the location of the actual image is located based on PS [20], which significantly increases the shape limitation. In addition, our method is just like the previous technology. We only need a single image. However, the focus of early methods is mapped from the observed Image value to 3D, reducing robustness and universality. When there is a dynamic background and an image is prone to phantom or occlusion, its performance will be compromised. To counter this, our method can accurately reconstruct a 3D posture from the disturbed, changed background and uncontrollable body parts.
SFM-based methods are very popular. From a series of images/frames, use the form of decomposition (?) From the corresponding 2D (pose ?) Estimate the 3D posture. First, in [17], it is used to restore the 3D posture of the rigid body structure. [8] in target generation, a method of Factorization is proposed for non-rigid structure by adding restrictions. In the interesting work of Wei [19], it is a 3D gesture that is restored from the hinge target. Multiple images with the same subject have different actions, ambiguity can be reduced by adding restrictions on rigid and non-rigid structures. In a nonlinear optimization architecture, we combine rigid and non-rigid structures to estimate camera parameters and bone length. Valmadre and so on [18] extended this method, using the basic factorization method and using the least squares (?) for parameters (?). A basic criticism of the previous SFM Method was that they needed multiple images. In addition, in order to find a solution that is prone to phantom and hidden location problems, they need manual input from the user. We provide a solution to automatically decode the direction of ambiguous parts. In the experiment, the performance of our method has a significant positive impact.
Estimation of 3D posture from 2D images has been investigated by other recent work. [], they increase the continuity of time to reduce ambiguity, however, we only estimate the 3D posture from a single image. 3D pose estimation in the corresponding points in a single image has been investigated in the earlier document [16. Recently, Simo-Serra and other [15] used a similar initialization step (from a noise-tolerant 2D node) and then used a different reasoning scheme. They use Covariance adaptation (CMA ,?) Sampling 3D pose space, however, our method adds motion and direction limitations. The use of CMA may lead to local minimums and produce inaccurate 3D estimates. However, in all test scenarios, our approach provides precise 3D gestures.

3. Proposed Method


Figure 1 Process Overview (from left) first input an image and process it using 2D location detection and self-occlusion inference. The next step is to generate multiple merged viewpoints through the initial viewpoint. Then, the SFM method is used to increase the motion limitation to reduce ambiguity. Finally, the orientation restriction is added from the synthesis viewpoint ing to the original viewpoint input to generate a 3D posture.

As shown in positive 1, the proposed algorithm can be divided into three sub-stages: 1) initialization 2) prediction and synthesis of viewpoint 3) Estimation of 3D posture. We use the best mixture of parts detection sub-[20] to initialize our algorithm pipeline (?). Although these probes are efficient in measuring the timeliness of hinge body parts, they still fail when self-occlusion exists. In the initialization step, we have added a small and efficient method to overcome the self-occlusion problem (? Part 1 ).
Ing a 3D model to the initial viewpoint produces an ambiguous posture. We have significantly increased geometric and motion models to reduce the ambiguity of 3D gestures by reducing the locations that are inconsistent with anthropomorphic ones (?). However, only these limitations are used to address the problem of ambiguous locations, especially the physical orientation (orientation/away from the camera ). Therefore, to solve the remaining ambiguity, we need more clues about the location direction. Here we propose a new inference method, which generates a synthetic (increased) viewpoint by learning the attitude distribution from the training data, and finally uses the SFM Method, estimate the relative depth of a part at the corresponding node in the initial and merged viewpoint. This can solve the problem of remaining ambiguity posture. It can be used not only in lab control cases (such as humaneva database), but also in difficult and phantom situations, for example, [12] in the IP database.
3.1 initialization considering the importance of initialization steps, we first propose a new method to process self-occlusion to improve the final posture estimation results.
Hybrid graph model structure (MOPs): Yang and ramanan [20] represent the body part as a hybrid graph Structure Model for posture estimation. In mops, nodes are located in different directions. The following is a symbolic form in the [20] document, with a specific attitude configuration score:

Is the hog descriptor extracted from the PI position of image I. The first sum represents the score of the image's position relative to the previously defined appearance template set, the sum of the second item is the encoding of the spring relationship between adjacent parts. Reasoning continues by maximizing the score of position P and type T (?).
Mops self-occlusion processing: In a tree structure model, a child's local score can be correctly traversed by his/her parents. However, when there is occlusion (partial or all), the tree structure is converted into a graph, and the score may be traversed by incorrect parents, resulting in loss of parts and inaccurate detection, as shown in 2a.

Figure 2 sample result (A) and Self-occlusion result (B) of the sub-detection using parts of [21)

In [11], we propose a regression method based on self-occlusion correction. We have observed that occlusion processing is more difficult than occlusion correction. In this article, we conduct occlusion detection in the mops framework, and mops encodes the motion structure in the tree. It implicitly assumes that the non-connected parts are independent, which is inconsistent with [14] under self-occlusion. To maintain independent assumptions, we can use confidence propagation to estimate the blocks from their scores. When the detection of Part I is not accurate enough or even lost, that is, it is blocked by other parts, the score of pixel P will be reduced to negative infinity. In the case of self-occlusion, the score of position P is:


In order to locate the hidden pixels, we will deal with the following situation: For each part of I, select K maximum fractional pixels; obtain its boundary box to represent the results of candidate parts; find the maximum coverage of other parts and parts I; if it exceeds the threshold, if the score at the P position is smaller than the score of the pixel around the overlapping area, the processing of Part I is treated as the occlusion of the pixel p. Therefore, we have weakened the feature structure, which makes it possible to create a spring between self-Ah non-connected parts due to the existence of Self-occlusion. Therefore, the local score is independent. Then we use the residual belief propagation reasoning process in [20] to obtain more accurate detection, 2B. In the experiment, we set K to 5, = 0.15 based on experience. Table 1 shows that the process of self-occlusion reasoning increases compared with the existing best results. For detailed evaluation protocols, refer to [20].


Table 1 deals with the impact of Self-occlusion in mops: Compared with the default mops, it has a small but continuous performance improvement. Here we use the probability of correct key points in [21] (PCK) as the evaluation standard.

3.2 multi-view Synthesis

To generate a precise 3D posture, we used the Wei and Chai [19] Methods to map the 3D template to the 2D node vector x obtained from the previous step. [19] Assuming that at least five 2D images can be obtained, the camera parameters are estimated using the SFM Method. Instead, we only use one 2D image. Here we assume that the camera's scaling parameters are uniform. For the ambiguity of depth information in different parts, we propose to estimate multiple synthetic viewpoints from the initial viewpoint, which allows us to add new restrictions on the direction space of each skeleton, this reduces the ambiguity of 3D gestures.

3.2.1 extracting 3D training data

In our experiment, all the data is collected from the CMU Motion Capture Database. A dataset of each viewpoint is a random selection of 5 frames from each video sequence. Based on the extracted 3D knots of each frame, we measured the angle of the human head, then rotated the 3D posture, and extracted the 3D knots (?) on a 360 degree polar coordinate (?). Landing landmarks (?) To the 2D plane in different directions, obtain the 2D points of all nodes in the polar coordinate angle.


Normalization SKELETON: In regression, the application of world coordinates often gets predictions of failures. Because human skeleton at different scales performs different actions, and during the conversion process, will produce great changes. To achieve a certain degree of change and scale, we use a template to normalize each viewpoint. The 2D input skeleton is a tree, the chip is the root node, the joint is represented as the node, and the edge between the parent node and its child represents the bone. In mathematics, it is represented as a skeleton with D nodes. Normalization is like this: first, the chip node is used as the reference node, convert each part of Xi to a start point (?) on chip (?). Step 2: Convert the result of the Cartesian coordinate system into polar coordinates. Xi = indicates the absolute length of the bone between the parent node P and the child node C. It is the coordinate of the bone relative to the horizontal direction. Step 3: refer to the basic skeleton x0 (? :), Scale transformation is performed on the length of each estimated bone. The advantage of normalization steps and the weakening of changes in scale conversion are to adapt to the input data of Gaussian distribution.
3.2.2 multi-point extension
Using normalization in all instances produces n samples at each viewpoint. Then, in this section, we will establish a specific model from viewpoint I to viewpoint J for regression. In our experiment, we collected 16 data points of view in the CMU database (0-360 degrees, with an interval of 22.5 degrees ). The key idea here is to generate a new skeleton from the input instance through the regression method. For this task, we use the Cascade form of twin-GPR. Finally, we use the established model to infer the actual model from the following posture.
Recently, the use of twin-GPR has replaced traditional regression methods, such as Gaussian process regression and ridge regression, In the structure prediction of 3D posture from image observations. Twin-GPR is a multi-variable regression method that encodes the relationship between input and output. Using the [5] method, we have established a regression model to generate a new viewpoint from the input viewpoint.
Considering that N instances are normalized from two consecutive viewpoint I to J, the goal of regression is that in a given input vector, prediction of the observed vector distribution. Therefore, the prediction of a test vector Gaussian will be measured by minimizing the divergence between input [5] distributions (??) :

The output distribution is:


Specifically, it is the normalized vector of the target attitude estimated by the input vector z, and the mean vector of the Training posture of the viewpoint I and the viewpoint J, and


It is a semi-definite covariance function that encodes the relationship between the input training vector I and the test vector z, and the correlation between the training target vector J and the estimated target vector, K is the nxn matrix of input I or target J. Each is a Nx1 matrix, indicating the correlation between vector z and matrix I and j (?). The problem is how to obtain Z ~ Calculate the distribution in eq.4 when the estimated value is used. To achieve this goal, we use the Kullback-Leibler divergence in the distribution of the eq.3 and eq.4 formulas. Then, through the iteration process, use the quasi-Newton optimization (?) Minimize divergence, use the ridge regression sub-initialization Z ~, Train each output vector separately.
Cascade twin-GPR: dollar and other [10] proposed an interesting regression method to gradually reach the true value in the form of cascade. In our framework, we restore multiple other viewpoints from the input viewpoint. One simple way is to learn how to map from one point of view to all points of view. However, as the number of models to be learned is large, the complexity of the system is increased. Inspired by [10], we will learn the regression model of specific points of view as a cascade twin-GPR problem. As a twin-GPR-based function, from the Zi ing of Zi-> ZJ, Zi is the normalized vector of the input attitude, ZJ is the vector of the new viewpoint, and ZJ is the viewpoint of Zi. The output of Reg is the input of the next iteration ,. In each step, add the model to the viewpoint and use a specific attitude for regression. Algorithm 1 is calculated n times, and the steps for generating a new viewpoint from the input viewpoint are outlined.

3.2.3 original viewpoint Estimation
To initialize the cascade regression process (alg.1), we estimate the direction of the initial viewpoint. Knowing the initial viewpoint of human posture will greatly reduce the ambiguity of 3D pose reconstruction [4]. initial viewpoint estimation using Gaussian mixture model (GMM) [7]. GMM uses the maximum probability Bayesian framework. Data that has been used to learn Bayesian regression models and to train GMM models. Currently, there are 16 viewpoints (classes) in total ). We divided each type of members into many combinations (based on experience, we used 50 for experiments ). According to the input initial viewpoint, in reasoning, the direction of the initial viewpoint is determined by the class with the highest possibility.

3.3 estimation of 3D posture 3. 3.1 Propagation of ambiguity 3D posture
To estimate the 3D posture, we first use the 2D node of the initial viewpoint and then raise it to the 3D posture. 3D pose parameters are vectors of a group of N 3D nodes, which correspond to 2D input nodes. 3D pose reconstruction can be seen as a solution of a linear system when multiple images are available. Instead, we only use one image and a group of 2D nodes. We assume that the internal camera parameter A is known (parameters are required and what ). From the known value A and ui, we can obtain the 2nx3n ing matrix m, indicating the relationship between the 3D node (in the camera coordinate system) and the 2D position. All nodes are expressed as mv = 0. Multiple constraints are required for solving this equation. Through the training data in [19], we can learn the upper and lower bone angle restrictions, and the motion restrictions are added here. For the ambiguous 3D pose, see Figure 1.
3.3.2 deduce a non-Ambiguous 3D posture
In order to solve ambiguity and obtain an accurate 3D posture, let's take the following two steps. As mentioned earlier, SFM-based 3D pose reconstruction is to map 3D models to the corresponding 2D nodes in different images by estimating the camera scale, bone length, and depth. Only one image implies that the camera's scale parameter is 1. First, with the help of merging the viewpoint, we delete the ambiguity depth of different parts. It is known that the input point of view and the point of View corresponding to the synthesis point of view. Our goal is to estimate the length of the bone and the depth of different parts. The regression step is to create multiple synthetic viewpoints to generate different bone scales. To solve this problem, we only display our work in one image (displayed on a human body). We can safely constrain this problem by correcting the response bone length in all viewpoints, make it consistent with the original input image.
Second, we need to estimate the relative depth of each part. Valmadre and Lucey [18] calculate the relative depth of each part by means of decomposition, start with weak perspective ing of two-dimensional corresponding points in different images, and obtain the required parameters by minimizing build errors. Inspired by [18], we use the same decomposition method to synthesize the relative depth of each part in the initial viewpoint.
However, in many cases, the ambiguity around the node angle symbol still exists. Valmadre's method [18] does not address the ambiguity of posture containing phantom locations. Therefore, you need to manually determine the direction of the ambiguous location (such as before and after ). In the framework we proposed, we developed an effective solution to this problem. Use projection ing on the basic viewpoint of an image. Then, we have determined the remaining ambiguous parts, which may still be the front and back directions. We repeat the first two steps of merging the viewpoint. We projected the 3D model to each synthetic viewpoint. Some parts of the 3D model of each viewpoint may be ambiguous, there is no other part. We searched all non-Ambiguous parts in the 3D posture, obtained from the synthetic viewpoint, corresponding to the ambiguous part G. This is to add a direction limit. Then, we iterate the borrow direction from the 3D pose of the input image until all the ambiguous parts are removed.
In this step, two or three instances are required for some images, but n viewpoints are required for others. This is when we add a viewpoint until all ambiguous parts are removed and then stopped. If one part has two or more directions, this part is still ambiguous. After multiple viewpoints are restored, one of the advantages of SFM is to cut out the noise-included prediction introduced by the regression process and improve the 3D attitude results.
4. In different experiments, we have measured the performance of the proposed method for restoring 3D posture from a single image, using both qualitative and quantitative methods.
4.1 All training data, including cascade twin-GPR and GMM methods, are obtained from the CMU mocap database. In all the available motion sequences, we randomly select 5 frames for each sequence. 14229 frames in total. For each frame, we extract 16 viewpoints by rotating the 3D skeleton. We tested our databases on different databases: humaneva-I for quantitative testing, and IP database for Qualitative Testing.
4.2 quantitative tests on the hamaneva database our algorithm performance was tested on humaneva-I's walking and jogging actions [13]. The test of the validation sequence shows the robustness of our method in restoring 3D posture. The training sequence of the regression model is extracted from CMU, showing the generalization ability of our algorithms.
Table 2 shows the numerical calculation result and comparison with the existing method. We followed [15] to conduct our experiments on the same sequence and evaluate our methods. The unit of the Mean Value Prediction and standard deviation is mm. In our method, the absolute error in the image [4] [9] is displayed. However, in [5] [15], the relative error is used. The closest method to our algorithm is Simo-Serra's [15]. Two of the methods are initialized using the observed noise value. In [4] [9], the time continuity restriction is added to remove ambiguity, which requires multiple images. Instead, we can estimate the 3D posture from a single image. Except [5], our method is better than all other methods. [5] The background subtraction division is used to make a strong assumption. Therefore, it is difficult to easily handle the changing background. On the contrary, our method is used to test images with different disturbance backgrounds without prior background extraction. Furthermore, the [5] training, verification, and testing sequences are all from humaneva-I. in our method, we train the regression model on the CMU database and test it on humaneva-I to show its good generalization ability.

Table 2 Compare the algorithm on humaneva. The value out of MM is an evaluation node error with the true value. The standard deviation is out of the brackets. [4] [9] No jogging test is provided. [5] A prior background is extracted.
In the initialization step, we propose a solution to the overlapping and missing parts when the self-occlusion exists, that is, to interrupt the elasticity between non-connection nodes (?). However, we are aware that this problem still exists to a certain extent. We need to find a robust technology to reduce the ambiguity of the observed value. Inspired by [15], the calculated shapes and values are strictly aligned (?), This further reduces the reconstruction error. In our experiment, the mean reconstruction error is about 200mm, and the mean distribution error is 90mm. Note that most of the errors are generated in deviation from the 2D node output in the initialization step.
About the calculation time, the 3D pose estimation of each image is about 1 minute, including the time needed to obtain the initial 2D viewpoint.

4.3 qualitative estimation in order to verify the robustness of our algorithms, two experiments were conducted on phantom images with high degrees of freedom and severe self-occlusion. Because the true values of these images in 3D pose cannot be obtained, the quantified visual comparison is shown here.
In the first experiment, we visually compared our method with the valmadre [18] method. For both technologies, the 2D node is manually labeled for initialization. [18] uses multiple different images to restore a 3D posture. Our method only uses a single image. In addition, the valmadre method cannot remove all ambiguity, especially the angle symbol of the node. You need to define these directions (positive/negative ). In our method, the symbols of non-Ambiguous parts are shared in different synthetic viewpoints, except for this ambiguity in most cases.
Fig.3 B. C shows the method in [18] and the 3D output of our algorithm. Specifically, the motivation for this comparison is to show the advantages of using SFM after recovering multiple viewpoints from the initial viewpoint. After decomposition, noise is filtered from the regression prediction results, which reduces the ambiguity of the final node.

Fig.3 (left) qualitative analysis: (a) input image (B) 3D posture restoration results of valmadre [18] images, 3D reconstruction using multiple different images (c) the result of the proposed method is initialized using 2D nodes in a single image. The 3D posture is normalized and centered on the starting point.

Visual comparison of the final 3D pose estimation in fig.4 (right) (a) Non-(B) self-occlusion processing. In (a), Self-occlusion leads to incorrect initialization and is propagated to the final 3D posture. (B) initialization is accurate, resulting in precise 3D pose estimation.

In the second experiment (fig.4), we estimated the impact of the proposed self-occlusion processing. The experiment was conducted on the image of the IP database [12]. Fig.4a shows the initialization result using the hybrid graph model structure. Fig.4b shows the output of the same image, but it has a self-occlusion processing mechanism. Visually, we can see that the self-occlusion processing mechanism improves the initialization accuracy, prevents the error from spreading to the merging viewpoint, and then to the final 3D posture.
5. Conclusion we propose a 3D pose reconstruction algorithm from a single 2D image. In the initialization phase, we used a famous 2D part detection sub to generate 2D nodes. Through self-occlusion processing, we propose a new method to improve the output of this step. To add more restrictions, we generate a synthetic viewpoint by performing regression from the initial viewpoint to multiple viewpoints. By ing a 3D pose to the posture obtained from the initial pose, motion and direction restrictions are added to reduce ambiguity. Experiments show that the proposed algorithm is promising. However, the observed noise values still affect the accuracy of the final 3D attitude. Future work will provide a more robust way to handle self-occlusion and be tested in a more primitive environment.

(The formula won't be pasted. Sorry, the first translation is not good, but the translation is not good... I don't know much about it in many places. I hope you can learn it together)




3D pose estimation of a single image with self-occlusion: monocular image 3D Human pose estimation under self-occlusion (iccv 13)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.