0. Background
Junbo Zhao the "energy-based Gan" network, which treats the discriminant as an energy function without the need for a significant probability interpretation, the function can be a training loss function. The energy function is to treat the area close to the real data manifold as a low-energy region, and away from it as a high-energy region. Similar to "Probability Gan", in training, the generator will generate as much of the minimum energy as possible when the counterfeit samples, and the discriminant will be assigned to high-energy (because it is forged). By seeing the discriminant as an energy function, more extensive network structures and loss functions can be used, not just the two value classifier of the logistic output. Among them, Junbo Zhao and other people based on this principle, put forward one of the implementation of the case, that is, the automatic encoder structure, energy is the error of the reconstruction time, so as to replace the classifier. And at this point the training is more stable than conventional GAN. and proposes a single-scale structure that can be trained to generate high-resolution images.
As early as 2006 LeCun thought that the essence of an energy model is to create a function that maps each point in a multidimensional input space into a scalar, which is called "energy" (see Golf course). The process of learning is a data-driven process used to modify the surface of the energy, so that the appropriate parameters set, the required areas are low energy, and the unwanted areas of high energy. Supervised learning is an example of this: for each sample X in the training set, the energy of (x, Y) will present a low value when Y is the correct label, and y is the wrong time to present a high value. Similarly, when unsupervised modeling is performed on X, low energy is when it is in the data manifold. The comparison sample here (Contrastive sample) represents those data points that typically cause an increase in energy, that is, in the case of supervised error tags and points in unsupervised low-density data areas.
Gan can be interpreted in two ways:
- The main part is the generator: The discriminant plays the role of a trained objective function. We assume that the data is on a manifold, and when the generator-generated sample is identified as being on the manifold, it gets a gradient to indicate how it modifies its output to be closer to the manifold. In such a scenario, the generator is punished by the discriminant when the generated sample is outside the manifold. This can be understood as a method for training generators, in order to generate a reasonable output;
- The main part is the discriminant: The generator is trained to generate a comparison sample. Through iterative and stepwise input comparison samples, the generator enhances the semi-supervised learning performance of the discriminant.
1. EBGAN1.1 an energy-based GAN network structure
The discriminant is used to represent the output through a target function, which is intended to establish the corresponding energy function, that is, the real data sample is given low energy, while the forged sample gives high energy. Here, a marginal loss is used to make the choice of the energy function. and the target function of the generator and the discriminant is different, such as probability gan. Given a positive marginal value \ (m\), a data sample \ (x\), a generated sample \ (G (z) \), the loss of the discriminant and the generator losses are \ (l_d,l_g\):
which
About the parameters of G to minimize\ (l_g\)Equivalent to maximizing\ (l_d\)The second item, when\ (D (G (z)) \geq m\)It has the same minimum value (because the second value is 0), and the gradient is not 0
Assume that the generator is identified with a G,\ (p_g\) is a density distribution of \ (G (z) \) , where \ (Z\sim p_z\). In other words,\ (p_g\) is the density distribution of samples generated by G. Define \ (V (g,d) =\int_{x,z}l_d (x,z) P_{data} (x) P_z (z) dxdz\) and \ (U (g,d) =\int_zl_g(z) p_z (z) dz\). We train the discriminant D for minimizing values \ (v\)and training g for minimizing values \ (u\). At this point Nash equilibrium is a pair of optimal solutions \ ((g^*,d^*) \), they meet:
From the above discussion, we can get two conclusions:
1.2 Based on an implementation of 1.1 (adding regular AE as a discriminant)
In this paper, the author presents an implementation based on the energy Gan structure, in which the discriminant is an automatic encoder:
Figure 1.2.1 Ebgan implementation of AE as a discriminant device
The significance of using AE instead of the traditional two classification function is:
- Unlike the two classification results of a bit, the refactoring-based output can provide different targets for the discriminant. Since there are only 2 target options in the logistic loss of the two classification, in one minibatch, the gradients corresponding to the different samples are basically not orthogonal to each other. This makes training very inefficient. On the other hand, refactoring loss can produce gradients in many different directions in a minibatch, allowing for a larger minibatch size based on the loss of essentially no loss efficiency.
- For the traditional perspective, AE is also used as an energy-based model, so it is also natural selection, and when coupled with a regular term, AE can learn an energy manifold on the basis of no supervised or negative class samples. This means that even when a Ebgan is implemented based on AE, it can only be trained with real samples, and the discriminant can find the data manifold on its own. This is not possible for the two classified logistic loss.
A common problem with training AE is that the model only learns the result of the identity function, which means that it is 0 energy for the entire space. So to avoid this problem, the model needs to give higher energy values to points outside the data manifold. This can be done by home plus the hidden layers in the middle. Such a regularization needs to be able to limit the expression of AE, so that it only gives low energy to the smaller areas of the input points.
In the Ebgan structure, the energy function (discriminant) is usually considered to be the comparison sample generated by the generator to the regular, in order to let the discriminant has higher reconstruction energy. The Ebgan structure is relatively more flexible from this point of view because:
- The regularization (generator) can be fully trained without the need for manual design;
- The strategy of confrontation training can make the process of generating contrast sample and the process of learning energy function contact directly with each other.
Based on the above, the repelling Regularizer is proposed to let the model not generate those clusters at a point of the sample or the basic not fit to \ (p_{data}\) distribution. Another technique called "minibatch discrimination" (improved techniques for training Gans) also follows the same principle.
The regular item presented here is called the pulling-away term, assuming that \ (s\in r^{s\times n}\) represents the output of the encoding layer of AE. where \ (n\) represents the number of samples in batch, and\ (s\) is the hidden layer output vector for AE.
PT is operated on a minibatch and is intended to be characterized by orthogonal sample-by-pair. The Cos similarity rather than the Euclidean distance is used to guarantee the lower and scale invariance of the term.
2. Experimental Analysis 2.1 parameter configuration search based on Mnist
The authors compared the results under different parameters of the lower Ebgan and Gan.
Only a subset of the parameter configurations are included in the table above, such as Ebgan, which only shows the layer structure of the encoder, including
- The M value is set to 10, and there is no change in training:
- After each layer of the weight layer, there is bn layer, in addition to the output layer and the input layer of the discriminant;
- The training image is scaled to [ -1,1] to accommodate the TANH activation function used by the generator output layer;
- Using Relu as a nonlinear activation function;
- Initialization: The initializer is n (0,0.002), whereas the generator is n (0,0.02). Offsets are initialized to 0
The model is judged by "Inception score"(improved techniques for training Gans)
You can see that the result of adding the PT regular item is the best, where the parameters are:
- (a): \ (nlayerg=5, nlayerd=2, sizeg=1600, sized=1024, DROPOUTD=0,OPTIMD=SGD, OPTIMG=SGD, lr=0.01.\)
- (b): \ (nlayerg=5, nlayerd=2, sizeg=800, sized=1024, Dropoutd=0,optimd=adam, Optimg=adam, lr=0.001, margin=10.\)
- (c): Same as (b), and \ (\lambda_{pt} = 0.1\)
2.2 Semi-supervised learning based on mnist
Appendix D
Generative adversarial Nets[ebgan]