Just finished the experiment, to answer an answer to the Natural language processing Gan application.
The direct application of Gan to the field of NLP (mainly the generation sequence) has two problems:
1. Gan was first designed to generate continuous data, but in natural language processing we used to generate discrete tokens sequences. Because the generator (generator, abbreviation g) needs to use the gradient from the discriminant (discriminator, abbreviation D) to train, and G and D need to be completely differentiable, there will be problems when there are discrete variables, only BP can not provide a training gradient for G. In Gan, we make a small change in the parameters of G to make the data produced by it more "lifelike". If the generated data is based on discrete tokens,d the information given is often meaningless because it is different from the image. The image is continuous, small changes can be reflected on the pixel, but you make a small change to the tokens, in the corresponding dictionary space may not have the corresponding tokens.
2.GAN can only rate the complete sequence that has been generated, and it is also a problem to determine the mass of the part generated and the quality of the entire sequence generated by the sequence in which it is generated.
A few important jobs:
1. In order to address these two issues, the earlier work was submitted by this article published in AAAI 2017: seqgan:sequence generative adversarial Nets with Policy gradient, September 16 was put on the arxiv, but also published the source code.
Use the reinforcement learning to solve the above problems. For the first problem, the first is to use the output of D as a reward and then policy gradient method to train G. For the second problem, through the Monte Carlo search, for the partial generated sequence, with a roll-out Policy (also a lstm) to sampling the complete sequence, and then to D grade, and finally to the reward to obtain the average.
Complete algorithm as shown:
Original link: https://arxiv.org/pdf/1609.05473v5.pdf
GitHub Link: Lantaoyu/seqgan
2. The second article is the c.manning group of great god Li Jiwei article: adversarial Learning for neural dialogue Generation, with GAN and intensive learning to do the dialogue system, if I remember correctly, This article paper is the earliest to cite Seqgan, there are students also said this is the first RL used to GAN, mainly Jiwei big God is too big, a place on the arxiv caused countless attention.
As pictured, the article is also used policy gradient method to train Gan, and Seqgan methods are not very different, mainly used in the dialogue generation such a difficult task. There are two points is: 1th, in addition to using Monte Carlo search to solve some of the problems generated sequence, because the MC search is more time-consuming, you can also train a special d to the partial generated sequence to grade. But from the experimental results, MC search performance to a little better.
The 2nd is in the training of G while also using the Teacher-forcing (MLE) method, this and the following Maligan have similarities.
The reason for this is that G does not have direct contact with the real target sequence (Gold-standard target sequence) during the course of confrontational training, and when G generates a poorly-quality sequence (which is actually quite difficult to produce a good quality sequence), and D is trained well, G Gets the reward to know that the sequence is bad, but does not know how to generate a better sequence, which results in a training crash. Therefore, through the training update of the parameters of G, also through the traditional MLE is the real sequence to update the parameters of G. It is similar to having a "teacher" to correct the deviations in the G training process, similar to a regularizer.
Original link: https://arxiv.org/pdf/1701.06547.pdf
GitHub Link: jiweil/neural-dialogue-generation
3. At the end of February, the Yoshua Bengio Group placed three consecutive paper related to Gan, of which we were most concerned by the great God Tong Che and li Yanran: maximum-likelihood augmented discrete Generative adversarial Networks (Maligan), the abbreviation reads strange ...
The work of this article is mainly in two aspects:
1. Construct a brand-new objective function for G, use the importance sampling, combine it with the output of D, make the training process more stable and the variance of the gradient is lower. Although this objective function is similar to the RL method, the dog is more capable of reducing estimator variance (it is strongly recommended to look at the 3.2 analysis of the original text, and to analyze that the new objective function still works when the D is optimal and the D is trained but not to the optimal).
2. The generation of a longer sequence needs to be used several times random sampling, so the article also put forward two techniques to reduce variance: The first is the Monte Carlo search, which everyone is familiar; The second article, called Mixed Mle-mali training, is to sample from the real data, if the sequence length is greater than N, then fix the first n words, then go to freely run G based on the first n words to produce m samples, and run to the end of the sequence.
The reason behind the word generation based on the first n words is that the conditional distribution PD is simpler than the complete distribution, and the strong training signal can be obtained from the real samples. Then gradually reduce n (in experiment three n=30, k=5, K is the step value, the training time every iteration n-k)
The Maligan complete algorithm for Mixed MLE training is as follows:
In 12, when the gradient is updated, the second (highlight part) looks like it should be logp (my most admired seniors send emails to ask for one). As for the first part why the gradient is approximate to this form, you can refer to another article in the Bengio group: boundary-seeking generative adversarial Networks
The intuition of this bgan is: g to learn how to generate a sample at the D decision boundary, so it is called boundary-seeking. The author has a special technique: as the graph, when D reaches the optimal time, satisfies the following condition, the pdata is the true distribution, the PG is the G generation distribution.
We have a little bit of a change to it: This form is so great that although we don't have the perfect g, we can still get the real distribution by weighting PG, which is the ratio of the best D and (1-d) of the G, as shown in the figure. Of course we can hardly get the best D, but the closer we train D to the optimal D,bias the lower. The training D (standard two classifier) is much simpler than G, because the objective function of G is a target that changes with D.
After the article gives a mathematical formula for how to find the gradient, no paste here.
Back to Maligan, the author gives the experimental data, better than the Seqgan effect, see Bleu score.
Original link: https://arxiv.org/pdf/1702.07983v1.pdf
4. Using Seqgan to do machine translation, the Chinese Academy of Sciences Automation in mid-March released this article: improving neural Machine translation with Conditional Sequence Adversarial Nets, the main contribution of this article is to apply Gan to the traditional NLP task for the first time, and the Bleu has a 2 promotion.
The model they call Csgan-nmt,g used is the traditional attention-based NMT model, while D has two options, one is CNN based, the other is RNN based, through the experiment to find the effect of CNN better. The reason is that the RNN classification model in the early training can have very high classification accuracy, resulting in the identification of G-generated data and real data, g difficult to train (because always negative signal),
The focus of this article I think is 4. Training strategy, Gan very difficult to train, they first use MLE to Pretrain g, and then use G-generated samples and real samples to Pretrain D, when D to achieve a certain accuracy rate, into the adversarial training link, The parts of Gan are basically the same as Seqgan, with policy gradient METHOD+MC search, which has already been said not to repeat. However, due to the confrontation training, G did not directly contact the golden target sentence, so every time with policy gradient update G run once professor forcing. Here I am more puzzled, I think is not like Jiwei that article, is using D to give the reward to update the G parameter, and with MLE to update the parameters of G (to ensure that G can contact the real sample, this is the target language sequence), But this method is teacher-forcing not professor forcing.
Finally, the training trick boundless, this article tried a lot of parameters, such as D to Pretrain to f=0.82 when the best effect, and pretrain to use Adam, and confrontational training to use Rmsprop, Also, like Wgan, the weights of each update d are fixed within a range.
Original link: https://arxiv.org/pdf/1703.04887.pdf
5. The last March 31 on the arxiv article: improved training of Wasserstein Gans, after the release of Wgan caused a sensation, such as Ian in the Reddit on the comment on this article, NYU and sacrificed this article, So that Wgan can also exert power in NLP.
In Wgan, the improvements they give are:
The final layer of the discriminant is removed from the sigmoid
Loss of generators and discriminant are not taken log
Each time the arguments are updated, the absolute value of the discriminant is truncated to no more than a fixed constant c.
Do not use momentum based optimization algorithms (including momentum and Adam) and recommend RMSPROP,SGD.
Here's a quote from self-knowledge: The astounding Wasserstein GAN-Know the columns
The article is written in simple, highly recommended.
The third is weight clipping, which is also used in machine translation, and in this article, they found that the Lipschitz restriction (in order to approximate the Wasserstein distance, which is difficult to calculate directly) by weight clipping, resulted in the instability of the training, and difficult to capture the complex probability distribution of the culprits. Therefore, the article proposes to use the gradient penalty to critic (that is, the D,wgan series will be D called critic) to try Lipschitz restrictions.
As shown in the figure: the loss function has the original partial + gradient penalty, now do not need weight clipping and momentum based optimization algorithm can be used, they use Adam here. At the same time can take off batch normalization.
As shown in the picture, the experimental results are astonishing, this WGAN-GP structure, the training is more stable, convergence faster, and can produce higher quality samples, and can be used to train different GAN structure, or even 101-layer depth residual network.
It can also be used to generate tasks in NLP, and it is the language model of Character-level, and the Maligan experiment is on sentence-level. And the previous several mentioned articles 2,3,4 in the confrontation training more or less use of MLE, make g more contact with ground Truth, but WGAN-GP is completely do not need MLE part.
Original link: https://arxiv.org/pdf/1704.00028.pdf
GitHub Address: https://github.com/igul222/improved_wgan_training
Code together to release the industry's conscience.
6. March 31 also released a began:boundary equilibrium generative adversarial Networks, at the same time the code also has carpedm20 with Pytorch write, he reproduced the speed of real fast ...
The last Gan a lot of progress, at the same time mentioned several important work of one or two, seemingly all in the know, to their high respect.