0. Background
Tim Salimans and others think that although the previous Gans can produce a good sample, however, training Gan essence is to find a continuous, high-dimensional parametric space on the non-convex game Nash balance. Unfortunately, finding the Nash equilibrium is a very difficult problem. In the existing algorithms for specific scenarios, it is not suitable for Gan games, because the implementation of Gan is usually a gradient descent method to train the GAN network objective function, rather than the real change and the Nash equilibrium in the game. And the objective function itself is a non-convex function, which is a continuous parameter and the parameter space dimension is very high, so if you really search for Nash equilibrium, then these algorithms are not convergent.
When everyone in the game thinks they are the least lost, then it is the Nash equilibrium. This intuitively makes us think that we can use the traditional gradient minimization method to minimize the loss function of each person at the same time. That is, suppose that the loss function of the discriminant and generator is:\ (j^{(d)} (\theta^{(d)},\theta^{(g)}),\ (j^{(g)} (\theta^{(d)},\theta^{(g)}) . Nash equilibrium point \ ((\theta^{(d)},\theta^{(G)}) \ (j^{(d)}\) about \ (\theta^{(d)}\) minimum and \ (j^{(g) }\) about \ (\theta^{(G)}\) min. This is very difficult to achieve, however, because when modifying \ (\theta^{(d)}\) and lowering \ (j^{(d)}\) , the \ (j^{(g)}\)is increased, and the modification \ (\theta^{(g) }\) and decrease \ (j^{(G)}\) , the \ (j^{(D)}\)is increased, so the gradient drop cannot converge. As an example:
- The objective function of a player is XY, whose parameter is x;
- Another player's target function is-xy, whose argument is Y
The gradient drop encounters a stable point, rather than converging to the x=y=0 (ideal equilibrium point).
In summary, it is the lack of a guarantee of convergence that the method of reducing the loss function of each player is minimized at the same time based on the gradient descent training of Gan. So Tim Salimans and others put forward three methods which are beneficial to the model training, which is intended to be better convergent:
- Feature matching: Similar to the maximum mean difference
- Minibatch Features: The idea of drawing on some Bn
- An extension of virtual batch Normalization:bn
1. Three suggested 1.1 feature matches
Feature matching is an issue that prevents overtraining on the basis of the current discriminant by specifying a new target on the generator. Instead of simply maximizing the output of the discriminant, the data generated by the generator can match the statistical characteristics of the real data. In this, the discriminant only needs to specify which statistical characteristics need to be matched. Specifically, Salimans and others by allowing the generator to match the eigenvalues on the middle layer of the discriminant. That
Let \ (f (x) \) represent the activation value of the middle tier of the discriminant, and the new target of the generator is \ (| | E_{x\sim p_{data}}f (x)-e_{z\sim p_{z (z)}}f (G (z)) | | ^2_2\). Where the discriminant of the \ (f (x) \) or in accordance with the previous method of training. The new objective function allows the generator to have a fixed point, which can accurately match the distribution of training data, although it is theoretically impossible to achieve this fixed point, but the experiment shows that this does help the traditional Gan training become stable and efficient.
1.2 Minibatch discrimination
In Gan training, a major failure condition is that the generator will fall into a parameter setting, and that position always outputs the same point. When the generator is stuck in this situation, the gradient of the discriminant will always point in a similar direction, so that there is no discriminant, and the generator will only produce the same result. Because the discriminant is processed individually for each sample, there is no coordinate information between the gradients, so there is no way to say how different the current generator's output is from the other output. that is, the discriminant can only recognize the true and false, but does not recognize whether it is from the generator of the same output .
Salimans and other people think that by allowing the discriminant to identify multiple samples at the same time, so as to avoid such problems, that is, "minibatch discrimination".
- Suppose that I input is \ (x_i\);
- \ (f (x_i) \in r^a\) represents the characteristic vector of the middle layer output of the discriminant;
- Multiply it by a tensor \ (t\in r^{a\times b\times c}\)to generate a matrix \ (m_i\in r^{b\times c}\);
- Calculates \ (l_1\) distances between rows based on different samples generated by the Matrix \ (m_i\),\ (i\in \{1,2,..., n\}\) :
\ (C_b (X_i,x_j) =exp\left (-| | m_{i,b}-m_{j,b}| | _{l_1}\right) \in r\)
where b denotes line B of the Matrix;
- This minibatch the output of the sample \ (x_i\) in the layer ( o (x_i) \) is defined as the \ (C_b (X_i,x_j) \) between the sample \ (x_i\) and the other samples the And:
\ (o (x_i) _b=\sum_{j=1}^nc_b (x_i,x_j) \in r\)
\ (o (x_i) =\left[o (x_i) _1,o (x_i) _2,..., O (x_i) _b\right]\in r^b\)
\ (o ({\bf x}\in r^{n\times B})
Figure 1.2.1 Minibatch Discrimination structure diagram
As described in the above steps, the last one is to get each sample\ (o (x_i) \)Stack by rows to get\ (o ({\bf X}) \), and then enter it into the network layer of the next layer of the discriminant. In the Minibatch discrimination, the forged data and the real data are calculated separately (i.e., there is no forgery and real data in a minibatch). In the same way, the discriminant needs to output its true probability for each sample, but now with the help of other samples in Minibatch, the side information. This allows the generator to generate visually acceptable falsified data faster, which is better than feature matching in this regard. But interestingly, when you are in semi-supervised learning, the goal you need is to get a strong classifier, then the feature matching method is better.
1.3 Historical averages (historical averaging)
When using the historical average method, each player's loss function will contain an item \ (| | \theta-\frac{1}{t}\sum_{i=1}^t\theta[i]| | ^2\), here \ (\theta[i]\) represents the parameter value at time \ (i\) . The historical average of the parameters can be updated online, so the learning rules can be applied to a long time series. This method is inspired by the fictitious play algorithm in iterative solution of games by fictitious play. Salimans and others found that the method can be found in low-dimensional, continuous non-convex game balance, such as a minimum maximum game, one of the players control X, another player control y, the value function is:
\[(f (x)-1) (y-1), \begin{cases}f (x) =x,& x<0\f (x) =x^2,& otherwise\end{cases}\]
In this kind of toy game, the gradient descent method fails because it cannot find the equilibrium point.
1.4 One-sided label Smoothing
The label smoothing method was originally from 1980s and has been reused in recent years. The main thing is to replace the classifiers 0 and 1 with smoother values, such as 0.1 and 0.9. This method can also increase the robustness of neural networks against samples.
The positive class result is multiplied by \ (\alpha\), and the negative class result is multiplied by \ (\beta\), thus the optimal classifier becomes:
\[d (x) =\frac{\alpha P_{data} (x) +\beta P_model (x)}{p_{data} (x) +p_{model} (x)}\]
However, the p_{model}\ in the molecule is problematic because when \ (p_{data}\) is close to 0 and \ (p_{model}\) is very large, from \ (P_{model} \) error sample does not allow the model parameters to be closer to the data (that is, to let the network learn the distribution of real data). Thus only smoothing the positive sample to \ (\alpha\), leaving the negative sample to 0 (i.e. not smoothing the negative class)
1.5 virtual bn (Vsan normalization)
Bn greatly improves the neural network optimization process, but it causes an input sample to rely heavily on the other inputs in the Minibatch when the neural network is output. To avoid this problem, a vbn is introduced, in which each sample is normalized based on the statistics of a batch of reference samples (reference batch), which, once selected at the beginning of the training, will be referenced throughout the process. This batch of reference samples of their own normalization is, of course, based on their own reference samples. VBN is computationally expensive because he needs to read two Minibatch data in forward propagation, so this method is only used in the build network.
2. Image quality Assessment
Gan is not able to compare performance with other models because it lacks a target function. An intuitive performance metric is to allow people to evaluate the visual quality of a sample. However, when the sample size is too large, the method is not feasible; an alternative is to expect to use other models to evaluate the quality of forged data: Use the inception model to calculate the conditional label distribution for each generated sample (P (y|x) \). Expect
- The entropy of the condition label distribution of the image containing the meaningful object is relatively low;
- The edge distribution of the different images generated by the model (\int p (y|x=g (z)) dz\) will have high entropy.
Combining the two requirements, the measure is:\ (exp (E_XKL (P (y|x) | | | P (y))), so that the value can be compared easily. This method cannot be successfully trained as an object, but it is a good measure to replace the manual evaluation.
3. Semi-supervised learning
For the usual multi-classification, it is \ (P_{model} (y=j|x) =\frac{exp (L_j)}{\sum_{k=1}^kexp (l_k)}\). For supervised learning, such a model is achieved by maximizing the cross-entropy between the real tag and the label that the model gives (P_{model} (y|x) \) .
The semi-supervised learning of the standard classifier is to add the sample generated by the Gan generator to the dataset, that is, to mark the generated data as a new "Generate" class,\ (y=k+1\), corresponding to the output dimension of the classifier from K to k+1. Then use \ (P_{model} (y=k+1|x) \) to provide the probability that the current input sample is forged (corresponding to the GAN structure \ (1-d (x) \)). Now we can also learn from the untagged data, and by maximizing \ (Logp_{model} (y \in\{1,... k\}|x) \) obtained. Assuming that half of the data set is real data and half is generating data, the loss function of the training classifier is:
\[l=-e_{x,y\sim p_{data}\; (x, Y)} \left[logp_{model} (y|x) \right]-e_{x\sim G}\left[\log P_{model} (y=k+1|x) \right]\]
Divide it into 2 parts, namely:
\ (l=l_{supervised}+l_{unsupervised}\)
which
\ (L_{supervised}=-e_{x,y\sim p_{data}\;(x, y)}\log P_{model} (y|x,y<k+1) \)
\ (l_{unsupervised}=-\{e_{x \sim p_{data}\; (x)} \log\left[1-p_{model} (y=k+1|x) \right]+e_{x\sim g}\log \left[P_{model} (y=k+1|x) \right]\}\)
The total cross-entropy loss function is divided into standard supervised loss function and unsupervised loss function, in which the unsupervised loss function is the standard GAN network.\ (D (x) =1-p_{model} (y=k+1|x) \)Substituting:
\ (L_{unsupervised}=-\{e_{x\sim p_{data}\;(x) \log D (x)}+e_{z\sim Noise}\log (1-d (G (z))) \}\)
At the same time, minimizing the optimal solution with supervisory loss and unsupervised loss is based on a scaling function\ (c (x), have \)\ (Exp[l_j (x)]=c (x) p (y=j,x) \forall j<k+1\)And\ (Exp[l_{k+1} (x)]=c (x) p_g (x) \), so that the unsupervised loss is consistent with the supervised loss and the optimal solution is achieved by minimizing both loss functions. In the actual operation,\ (l_{unsupervised}\)It is only helpful if the minimization classifier is not troublesome, so you need to train g to approximate the real data distribution. One method is to train g to minimize the value of the GAN network by using the classifier as the discriminant D. Salimans and other people although not understand the relationship between the G and the classifier, but the experiment shows in unsupervised learning, the use of feature matching method to optimize the G effect is very good, and the use of Minibatch discriminiation a little effect.
The k+1 classifier here is a bit too parameterized. If you subtract a function from each output logit ( f (x) \), that is \ (L_j (x) \leftarrow L_j (x)-F (x) \forall j\). It does not change the output of the Softmax. This means that the equivalent of \ (l_{k+1} (x) =0\forall x\), and thus \ (l_{supervised}\) becomes a supervised loss function of the K class, thus the discriminant is \ (D (x) =\frac{z (x)} {z (x) +1}\), where \ (z (x) =\sum_{k=1}^kexp[l_k (x)]\).
4.1 Importance of image quality labels
Generative adversarial nets[improved GAN]