Combat training to learn from analog and unsupervised images-refine synthetic image training

Source: Internet
Author: User

Articles from Ashish Shrivastava 1, "Learning from simulated and unsupervised Images through, adversarial training". Summary

Without expensive annotations, it is easier to train the model with synthetic images. However, the effect of synthetic image is not satisfactory because of the difference between the distribution of synthetic image and the real image. Therefore, "Analog + unsupervised" (s+u) Learning: Keep The annotation information given by the simulator, and use the real data without tags to improve the authenticity of the output of the simulator (simulator). S+u Learning method: Against the input of the network as a composite image, not a random vector. Changes to standard Gan to preserve annotations, to avoid synthetic phenomena (artifacts) and stability Training: (i) "from the regular", (ii) Local antagonism loss, and (iii) to update the discriminant using the history of the thinning image (refined images). Generalization to real images: qualitative and user studies to show how lifelike images are generated. The training model is to estimate the gaze and gesture of the hand and quantitatively evaluate the generated image. 1. Introduction

Labeling large datasets can be expensive and time-consuming, but it automatically gets the annotations for synthetic data. The synthetic data has been used to resolve the Kinect's hand posture estimates and some other recent tasks.
There are problems in learning to synthesize images: the difference between synthetic and real images-synthetic data is usually not real enough to make the network only learn the details of synthetic images, but it is difficult to generalize to real images.
One solution is to improve the simulator, while increasing the fidelity calculation is expensive, the design of the renderer is very heavy, and the top-level renderer may still be difficult to model all the characteristics of the real image. This may cause the model to fit into the "untrue" details of the composite image. S+u learning should retain the annotation information of the training machine learning model, such as keeping the gaze direction in Figure 1.

The S+u Learning Method (Simgan) refines the composite image with a refinement network ("Refiner Network"), as outlined in Figure 2, the synthetic image is generated by a black box simulator and refined by a refinement of the network. (i) to increase the fidelity, similar to Gans training against the network, using regular loss, so that discriminant networks can not distinguish between the refinement of the generated images and real images. (ii) In order to preserve the annotation of the synthetic image, to supplement the loss of the damage, to punish the large changes between the composite image and the real image. Further using a full convolution network to manipulate pixels and retain the global structure (instead of completely altering the image content, as with a fully connected coding network). (iii) The GAN Framework trains 2 networks with competitive targets, making the network unstable and easy to introduce synthetic phenomena. Therefore, the sensing field of the limiting discriminant is to the local area (not the whole image), so that each picture has several partial resistance losses. and to stabilize the training with the refinement of the history of the image (rather than the refinement of the current refinement of the network output) to update the discriminant.

2. Use of Simgan S+u learning

S+u learning is to learn to refine the Yi∈y rθ (x) of the synthesized image x by using a label-free real image, theta as the parameter of the refiner. The x~ represents a refinement of the image:
x~:=rθ (x)
S+u learning requires that the annotation information of the simulator be preserved while the thinning image x~ should look close to the real image.
At this point, combine 2 losses to minimize the learning of theta:
LR (θ) =∑ilreal (θ;x~i,y) +λlreg (Θ;X~I,XI). (1)
In which, Xi is the training image synthesized by the I-amplitude, and the x~i is the first I-thinning image. The 1th part loss lreal increases the fidelity of the synthesized image, while the 2nd part loss Lreg preserves the annotation information by minimizing the difference between the synthesized image and the thinning image.

2.1 On self-regularization (self-regularization) against loss

The ideal refiner will make its output image difficult to distinguish between true and false. Therefore, the training antagonism discriminant network Dϕ to classify the true and false of the image, in which Φ is the parameter of discriminant network. Training to refine network R against loss to "fool" network D to determine the true and false image. Use the Gan method for the smallest maximum game of 1 dual players, and alternately update the refinement network rθ and discriminant network Dϕ.
Minimize the following loss to update the parameters of the discriminant network:
LD (Φ) =−∑ilog (Dϕ (x~i)) −∑jlog (1−dϕ (YJ)). (2)

Hope that the discriminant can determine the real image is not synthetic image: Dϕ (yi) ↓,1−dϕ (yi) ↑,−∑jlog (1−dϕ (YJ)) ↓;
I hope the discriminant can discriminate the thinning image into synthetic image: Dϕ (x~i) ↑,−∑ilog (Dϕ (x~i)) ↓.

It is equivalent to the cross entropy of the two classification problem, where Dϕ (.) is the probability of inputting the composite image, the 1−dϕ (.) is the probability of entering the real image. Dϕ with convolution network, the last layer of the network output sample is the probability of thinning the image. When training this discriminant network, each small block (Minibatch) contains the fine-grained synthetic image x~′is and the real image y′js of random sampling. The target tag for each YJ cross loss layer is 0, and the target label for each x~i is 1. The gradient on the loss of the small block is updated with a random gradient descent (SGD) step to update the parameters of the small block.

Here, the Fidelity loss function in equation (1) lreal using a trained discriminant D:
Lreal (θ;x~i,y) =−∑ilog (1−dϕ (x~i)) =−∑ilog (1−dϕ (xi)). ( 3)

It is hoped that the refiner makes it difficult to discriminate the refinement image into a composite image: Dϕ (rθ (xi)) ↓,−∑ilog (1−dϕ (rθ (xi)) ↓.

Minimize the loss function. In addition to generating lifelike images, the refinement of the network should retain the emulator's annotation information. For example, the gaze estimate: The changes learned should not change the direction of gaze; hand posture Estimation: The position of the joint should not be changed.

Thus the machine learning model can be used to refine the image with the annotation information. For this reason, it is proposed to minimize the image difference between the synthetic image and the thinning image by the self regular loss. Therefore, the total loss function in the equation (1) is:
LR (theta) =−∑ilog (1−dϕ (rθ (xi)) +λ| | Rθ (xi) −xi| | 1. (4)
which | |.| | 1 is L 1 regular. Full convolution neural networks without spanning (striding) or pooling as rθ. Modify the composite image at pixel level instead of completely changing the image content as the fully connected encoder and preserving the global structure and annotations.
Alternately minimize LR (θ) and LD (Φ) to learn the parameters of the refiner and the discriminant: when updating rθ parameters, keep Φ unchanged, and when updating Dϕ parameters, keep theta unchanged.

2.2 Partial loss of combat

Also requires not to introduce the composition of the same time, the refinement of the network should learn the characteristics of the real image: the training of a single strong discriminant network, the thinning network often too much emphasis on specific image features to fool the current discriminant network. The local blocks sampled from the thinning image should have similar statistical properties to the corresponding blocks in the real image. Therefore, a discriminant network (rather than a global discriminant network) is defined to classify all the blocks of the image separately. This limits the size of the field (judging the capacity of the network), provides a lot of samples for the Learning discriminant network, and better trains the thinning network (multiple "real loss" of each image).

Here, the design of the discriminant D is a block probability graph of the output wxh dimension to determine whether the input block is a composite image. Where WxH is the number of local blocks in the image. When training to refine the network, wxh the sum of cross entropy loss on local block, see Figure 3.

2.3 Historical update classifier with thinning image

Confrontation training Another problem: The Discriminant network focuses only on the thinning image on the most recent time step. This may result in: (i) training divergence, (ii) refinement of the network to introduce discriminant network forgetting synthesis phenomenon.
For discriminant networks, all the time steps in the training process, all the thinning images generated by the refinement network are synthetic images. Therefore, the discriminant should be able to classify all these images into synthetic images. Based on this, we use the historical update of thinning image to discriminate the network to improve the training stability (instead of using only small blocks on the current time step). Modify Method 1 so that it has a fine-grained image buffer that was previously generated by the network. In Method 1, make B the size of the buffer, and B is the size of the small block.

Each iteration of the network is trained to update the parameter Φ by sampling B2 images from the current thinning network and buffer respectively. Fixed buffer size B. After each iteration, random sampling of the B2 image from the buffer is taken as a new, resulting thinning image, as shown in Figure 4.

3. The experiment

Mpiigaze data sets and NYU hand posture of the depth image data set on the evaluation method. All experiments with a full convolution refinement network (with residual network module), see Figure 6. 3.1 Gaze (gaze) estimate

Especially when encountering low quality images (laptops or mobile phone cameras), it is challenging to estimate the gaze direction from the eye image. Even humans are challenged to label eye images with a gaze-direction vector. To generate a large number of annotation data, recent researchers have used a large number of synthetic data to train the model. Here, the refinement of the synthesized image generated by Simgan on the task is significantly improved.

The focus of the estimate dataset contains the 1.2M image of the Unityeyes synthesized by the eye-gaze synthesizer and the real image on the 214K Mpiigaze dataset, as shown in Figure 5.

3.1.1 Qualitative Results

Simgan successfully acquired skin texture, sensor noise and iris area appearance in real images. Note that this method improves fidelity while retaining the annotation information (gaze direction). 3.1.2 Visual Turing Test

In order to quantitatively evaluate the visual quality of thinning images, a simple user study is designed to ask users to classify the synthesized images and real images.
Display 50 images randomly selected by each user and 50 images in random order, showing the user a continuous display of 20 pictures at a time. Overall analysis, 10 users can choose 517 times (p=0.148) from 1000 (50+50) X10 attempts, which is slightly better than chance. Table 1 is a confusing matrix.
Instead, show 10 real images and 10 synthetic images per user, 200 times (10+10) x10 the correct choice of 162 times (p≤10−8), much better than the chance to try.

h0:μ≤0.5; h1:μ>0.5. P-Value Calculation results 2:0.148344675387;9.92185044371e−20

Print Stats.binom_test (517, 1000, 0.5, ' greater ')
Print Stats.binom_test (162, 0.5, ' greater ')
3.1.3 Quantitative Results

Train a convolution network to predict eye gaze direction (with 3-D vector [x,y,z] encoding and L2 loss). Unityeyes on the train, Mpiigaze on the test. Figure 7 and Table 2 Compare the results of a convolution network with synthetic data and a refined synthetic data (Simgan output). The results of the Simgan output are 22.3 higher.

Table 3 compares the latest results. The training convolution network on the thinning image is 21 higher than the newest result on the Mpiigaze dataset.

3.1.4 Application Details

Refine the network, rθ, as a residual network. Each residual network module contains 2 convolution layers, each containing 64 feature graphs, as shown in Figure 6.
The 3x3 sized filter convolution 55x35 the size of the input image, outputting 64 feature graphs. The output passes through 4 residual modules. Finally, the output of the last 1 residual modules is exported through 1 1x1-size convolution layers to output 1 corresponding feature maps of the refined composite image.

Discriminant Network, Dϕ, contains 5 convolution layers and 1 maximum pool layers, as follows:
(1) Conv3x3,stride=2,feature maps=96
(2) Conv3x3,stride=2,feature maps=64
(3) maxpool3x3,stride=1
(4) Conv3x3,stride=1,feature maps=32
(5) Conv1x1,stride=1,feature maps=32
(6) Conv1x1,stride=1,feature maps=2
(7) Softmax

The network is designed to be a full convolution network, which makes rθ similar to the Dϕ of the last 1 layers of neurons in the network. First, train the rθ network with only the regular loss 1000 steps, the training dϕ200 step; then, Dϕ Update 2 times every 1 times, that is, in Method 1, Kd is set to 1 and Kg is set to 50.

Note: The first training to refine the network and discriminant network, and then training. Kg should be changed to 2, but it may also be 50 ~

Eye gaze estimation The input of the network is a grayscale image of 35x55 size, after 5 convolution layers and 3 fully connected layers, and finally 1 fully connected layers encoded 3-D gaze vectors:
(1) Conv3x3,feature maps=32
(2) Conv3x3,feature maps=32
(3) Conv3x3,feature maps=64
(4) maxpool3x3,stride=2
(5) Conv3x3,feature maps=80
(6) Conv3x3,feature maps=192
(7) maxpool2x2,stride=2
(8) FC9600
(9) FC1000
(a) FC3
(one) Euclidean loss

The constant learning rate of 0.001 and 512 blocks is used to train all networks until the verification error converges. 3.2 Estimation of hand posture from depth image

The NYU hand gesture data set consists of 72,757 training frames and 8,251 test frames captured from 3 Kinect cameras (1 front view, 2 side look). Hand posture information is used to create a synthesized depth image with which to annotate each depth frame. Figure 10 shows one of the frames. When preprocessing, the hand pixel region is cropped from the real image by the synthetic image, and the 224x224 is scaled to the size of the convolution network. The background depth value is set to 0, and the foreground depth value is set to the original depth value minus 2000 (assuming the camera is from the background mm).

3.2.1 Qualitative Results

Fig. 11 is a sample output of the Simgan on the NYU hand gesture data set. Obviously, the noise in the real depth image mainly comes from the discontinuity of the depth at the edge. Simgan without any annotation information of real images, we can learn to model the noise and make these synthetic images more lifelike.

3.2.2 Quantitative Results

Similar to the stacked hourglass network (stacked Hourglass net), the NYU hand posture training set for real images, synthetic images and refined synthetic images are trained on the 1 full convolution hand attitude estimation cnn;nyu The hand posture test set for all authentic images on the evaluation network.
Figure 12 and Table 4 are the quantitative results on the NYU hand attitude data set.

Refinement of the synthetic data (Simgan output) on the training does not require any annotation of the real image. Compared with the supervised real image, the result is more than 8.8, the training effect is better and the training samples increase greatly.

3.2.3 Application Details

The structure of the refinement network is the same as that of eye gaze estimation, in addition to the input 224x224 size of the image, the filter size is 7x7, with 10 residual network.

Discriminant Network, Dϕ, for:
(1) Conv7x7,stride=4,feature maps=96
(2) Conv5x5,stride=2,feature maps=64
(3) maxpool3x3,stride=2
(4) Conv3x3,stride=2,feature maps=32
(5) Conv1x1,stride=1,feature maps=32
(6) Conv1x1,stride=1,feature maps=2
(7) Softmax

First, train the rθ network with only the regular loss 500 steps, the training dϕ200 step; then, Dϕ Update 2 times every 1 times, that is, in Method 1, Kd is set to 1 and Kg is set to 2.
Manual attitude estimation network with 2 hourglass modules, output 64x64 size of the heat map. When training, randomly rotate [−20,20] and crop to broaden the data. Train all networks until the validation error is convergent. 3.3 Analysis of the changes in confrontation training

Compare training with local and global combat losses. Local combat loss removes the synthetic phenomenon, making the resulting image more realistic, as shown in Figure 8.

Use the history of thinning images, compared to the standard confrontation training in the gaze estimate, as shown in Figure 9. Thinning image buffering hinders standard training in severe synthetic phenomena, such as around the corner of the eye.

4. Summary

The main purpose of this article is: The synthetic image can be labeled automatically, but a large number of real image annotation cost is high; the simulator generates synthetic image, refines the synthesized image by refining the network output, refines the synthetic image to approximate the real image, and retains the annotation information; the actual image test set, compared with the original real image training model, The model that is trained with the refined synthetic image is better.

The full text is not corrected, there are questions Welcome to point out ~ (๑ ̀ㅂ ́) و✧

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.