Author: Cao Gongze
Link: https://zhuanlan.zhihu.com/p/24421479
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author to obtain authorization, non-commercial reprint please indicate the source.
ARXIV Transmission Door
Recently, a paper on GAN on the NIPS is very concerned, there are many discussions on the Reddit, because the results of its production is very impressive, as shown:
The next row is Stackgan, the resulting picture not only the highest resolution, but also the most authentic.
The main motivation of Stackgan is that if we can't generate high-resolution and plausible images at once, we can build them two times. The idea of a staged picture is not uncommon, and the idea of the Lapgan of Denton et all is to refine the resulting low-resolution picture over and over again, Xiaolong et all S^2gan also divides the process of generating the picture into Structure and sty Le two stages. Of course there is no caption as condition, no comparison.
About the text to image this task, before someone to do with GAN, such as generative adversarial text to image synthesis, the method of thinking and Stackgan basically the same, are in the generator With the discriminator before adding text embedding as condition. But the structure of Stackgan should be fancy, as follows. First Stage:
From the beginning of embedding, Stackgan did not directly embedding as a condition, but with the embedding of a FC layer to get a normal distribution of the mean and variance, and then from this normal distribution of sample out to use Condi tion. The reason for this is that embedding is usually relatively high dimensional, and relative to this dimension, the number of text is actually very few, if the embedding directly as condition, then this latent variable in latent space is relatively sparse Sparse, which is bad for our training. My understanding is that if the number of text is small, then even if we have a higher dimensional latent variable, but one of the most is a discrete text embedding, the equivalent of a real random variable relatively less, so the generation of data manifolds will become discontinuous (because of the lower dimension) , this is what we don't want to see. and from the parameterized normal distribution to use the condition words, the equivalent of embedding around the point will also be as condition, which increases the number of text and condition dimension. In order to prevent this distribution degenerate or the variance is too large, generator's loss has added a regularization to this distribution:.
Generator is not a commonly used deconv, but a combination of several sample-and-resize 3x3 conv, a recently proposed method to avoid the deconv checkerboard effect.
Discriminator is a number of steps of 2 conv, and then combined with the embedding of resize, followed by an FC. Second Stage:
The second phase of the generator does not have noise input, but instead combines the first phase of sample downsample with the augmented embedding (sampled from Gaussian) as input. After a number of residual blocks, the same as the first phase of the sampling process to get the picture.
The second phase of the discriminator is roughly the same as the first phase. my point of view
To be honest, Stackgan not too many new ideas and methods to brighten people's eyes, but it will two-phased generation, sentence-embedding, semi-supervised learning together, The experiment was also done well (I think it would be hard to train for this structure). See Reddit someone said, unsupervised learning was originally to do not have to painstakingly label for the purpose of development, but a bit ironic is, to the more label, it work better. This is of course inevitable, to the more label, generator can be a complex distribution (such as imagenet), decomposed into a number of simple, low dimensional distribution modeling, discriminator can also be judged separately.
In addition, paper only at the end to give some of the failed sample, part of the text is inconsistent, part of the sample is too much, the former I tend to think it may be embedding error. Given the complexity of the structure, I don't see open source implementations (if the author doesn't release the source code), I really care about his accuracy.
code has been release:
Github
Original address: https://zhuanlan.zhihu.com/p/24421479