Https://www.bilibili.com/video/av9770302/?p=15
It says earlier that auto-encoder,vae can be used to generate
The question of VAE,
AE's training is to let the input and output as close as possible, so the resulting image is just imitating the training set, and can not generate he has not seen, or new pictures
Since VAE does not really understand and learn how to generate new images, for the following example, he can not distinguish between two cases of good or bad, because from lost on the view is more than 7 a pixel
So to produce Gan,
We all know that Gan is against the network, is generator and discriminator confrontation, confrontation is a gradual evolution of the process
Process is,
We train V1 discriminator by V1 's generator output and real images, so V1 can distinguish between the two.
Then, the V1 generator and V1 discriminator as the overall network training (here need to fix discriminator parameters), the goal is to let generator produce pictures can cheat V1 discriminator
This produces a V2 generator, repeating the above process, allowing generator and discriminator to evolve gradually, respectively.
The detailed process of training discriminator,
The detailed process of training generator,
You can see that generator will adjust the parameters, resulting in an image to distinguish discriminator 1, that is, cheated discriminator
And in the network training, although the generator and discriminator together training, but to fix live discriminator parameters, otherwise discriminator just need to simply cater to generator can achieve the goal, The effect of no confrontation
The following is a theoretical view of Gan,
The purpose of Gan is to generate a distribution close to the target distribution (the distribution represented by the training set).
Pdata is the distribution that the training data represents.
PG is the distribution we want to generate
So our goal is to keep PG and pdata as close as possible.
From pdata to sample any m points, and then use these points to calculate PG, with maximum likelihood estimate, calculate likelihood
The probability of these points in the PG and as large as possible will make the PG distribution close to pdata
The derivation here is that the maximum likelihood estimate given above is equivalent to the KL divergence of pdata and PG, which is made sense, and the KL divergence itself is used to measure the similarity of two distributions
Here PG can be any function, for example, you can use Gaussian mixture model to generate PG, then Theta is the Gaussian mixture of each Gaussian parameter and weight
So given the parameters and a set of sample X, we can calculate the PG using the Gaussian mixture formula, and according to the above deduction, we get two of the KL dispersion.
Of course, the Gaussian mixture model is not strong enough to fit pdata well.
So here is the first advantage of Gan, we can use NN to fit PG
This figure is that Gan's generator,z conforms to the Gaussian distribution, z is what the distribution is not critical can be other distributions
Through the GZ function, the x,z can be obtained from the Gaussian distribution of a lot of points, so the calculation of a lot of x,x distribution is PG; As long as the NN is complex enough, although Z's distributed Gaussian, X can be arbitrarily distributed
The difference between this and the traditional method, such as the Gaussian mixture, is that this likelihood, or PG, is not good, because G is an NN, so we have no way of directly calculating the KL dispersion of two distributions.
So Gan needs discriminator, it is also an NN, using discriminator to indirectly calculate the similarity between PG and pdata, thus replacing the calculation of KL divergence
Gan can be divided into generator G and Discriminator D, where D is used to measure the similarity between PG and pdata
The formula for the ultimate goal of optimization, looking very bluffing, but also Min, and Max
Actually divided into two steps,
Given G, optimize D, make MAXV (red line part), that is, training discriminator, calculate the difference between the two distributions; in the middle of every small plot to find the red dot
Given d, the optimization of G, so that min (MAXV), is in the training of generator, minimizing the difference between two distributions; it's picking out G3 in the middle.
Here's a question that's not clear,
Why is a given G, optimized d, so that maxv, the resulting v can represent the difference of two distributions?
If this problem is understood, the next step in optimizing G, to minimize the difference between this distribution is well understood
To do some simple conversions, if we want the last step to be the largest, then the equivalent of the maximum for each x, the content of the integral
Here is the given g,x,pdata (x), PG (x) is a constant, so a simple function of converting to D
The maximum value, the extremum, is the derivation to find the pole
Here we derive the definition of D when v Max, and the range of D should be between 0 and 1.
The above deduced if you want to vmax,d to meet
So the formula that further d is brought into V is obtained by a series of derivations, and V is equivalent to Jensen-shannon divergence
Jensen-shannon the definition of divergence, as below,
Better than KL divergence, KL is asymmetric, while Jensen-shannon divergence is symmetrical and can better react to differences between two distributions
Then the derivation here proves that given G, optimization D let v the largest time, V represents pdata and PG Jensen-shannon divergence, so this Vmax can represent this two distribution differences, also answered the previous question
GAN (generative adversarial Network)