Since it was proposed, the

GAN has been widely concerned, especially in the field of computer vision, which has aroused great repercussions. "Deep interpretation: Gan model and its progress in the 2016" [1] A detailed introduction to the progress of Gan in the past year, very recommended to learn from the beginners of Gan read. This article mainly introduces the application of Gan in NLP (which can be regarded as paper interpretation or paper notes), does not involve the basic knowledge of Gan (no Gan basic knowledge of the small partner recommended first look [1], because I am lazy, it is not here to repeat the basic knowledge of Gan J). As a long time did not write Chinese articles, please do not correct the article in the place a lot of inclusion, advice.

Although Gan has achieved good results in image generation, Gan has not achieved surprising results in natural language processing (NLP) tasks. The reason can be summed up as follows: The original Gan mainly used real space (continuous data), in the generation of discrete data (texts) This problem is not work. Dr Ian Goodfellow, the author of GAN theory, answers the question: "Gans is not currently applied to natural language processing (NLP), where the original Gans is defined only in the field of real numbers, Gans generates synthetic data by training generators, Then run the discriminant on the synthetic data, and the output gradient of the discriminant will tell you how to make it more realistic by slightly altering the synthetic data. In general, only in the case of continuous data, you can slightly change the composite data, and if the data is discrete, it can not simply change the composition of the data. For example, if you output a picture with a pixel value of 1.0, then you can change the value to 1.0001. If you output a word "penguin", then you cannot change it to "Penguin + 001" Because there is no "penguin +.001" word. Since all natural language processing (NLP) is based on discrete values such as "word", "letter" or "syllable", it is very difficult to apply Gans in NLP. In general, the enhanced learning algorithm is used. As far as I know, no one has really begun to study the use of enhanced algorithms to solve NLP problems. ”

When text is generated, GAN models the entire text sequence. For a partially (partially)-generated sequence, it is very difficult to determine the fraction of its subsequent generation of the entire (fully) sequence.

Another potential challenge relates to the nature of the RNN (most of the generated text takes the RNN model). If we try to generate text from latent codes, the error will accumulate exponentially with the length of the sentence. The first few words may be relatively reasonable, but the sentence quality will continue to get worse as the length of the sentence increases. In addition, the length of the sentence is generated from the random latent representation, so the length of the sentence is difficult to control.

Here I will introduce and analyze some of the most recent papers I've read about applying gan to NLP:

1. Generating Text via adversarial training thesis Link: http://people.duke.edu/~yz196/pdf/textgan.pdf This is the 2016 NIPS GAN A paper on Workshop tried to apply the GAN theory to the text generation task. The method in this paper is simple, which can be summed up as follows: A recursive neural network (LSTM) as the generator of Gan (generator). The method of smoothing approximation (smooth approximation) is used to approximate the output of lstm. The structure diagram is as follows:

The objective function of this paper is different from that of the original Gan, and the method of feature matching is adopted. The iterative optimization process consists of the following two steps:

The formula (6) is the optimal function of the standard GAN, and the formula (7) is the optimization function of the feature matching. The initialization of this article is very interesting, especially in the training of the discriminant, using the original sentence and the two words in the sentence after the location of the new sentence to judge the training. (in the process of initialization, the discriminant is optimized by the point-by-spot classification loss function). This is very interesting, because the two-word interchange position, the input of the data information is actually basically the same. For example, most convolution calculations will eventually come up with exactly the same value. The update frequency of the generator is 5 times times the frequency of the classifier, which is exactly the opposite of the original Gan setting. This is because the lstm is more difficult to train than the CNN argument. However, there are exposure bias problems in the generation model (LSTM) decode phase, that is, the prediction output is used in the training process to replace the actual output as the input of the next word.

2. Seqgan:sequence generative adversarial Nets with Policy gradient thesis link: https://arxiv.org/pdf/1609.05473.pdf

Thesis Source: Lantaoyu/seqgan

The text uses the error as an incentive to enhance learning, train in a feedforward way, and update the G network with an enhanced learning model.

Main content: This paper treats the sequence generation process as a sequential decision making process. The following figure:

(a) The left image of the GAN Network Training Step 1, the discriminant D is mainly used to distinguish between real samples and forged samples, where the discriminant D is implemented by CNN.

(b) The right picture is the GAN network training Step 2, according to the discriminant probability of the judge D return to the generator G, through the enhanced learning method to update the generator G, where the generator G is implemented with LSTM.

(c) Because the update strategy of G network is to enhance learning, four elements of learning enhancement State, action, policy, reward are: State is now generated tokens (the result of LSTM decoder before current timestep), The action is the next token to be generated (the current decoding word), policy for the GAN Generator G network, reward for the discriminant probability generated by the discriminant d network of Gan. Among them, reward uses the following methods to approximate:

The characteristics of this process: that is, when decoding to T, that is, the back t-t a timestep using Monte Carlo search to search for the N-Path, the N-Path and the results of the decode have been composed of N-complete output, and then the D network corresponding to the average value of the reward as a reward Because when t=t can no longer explore the path, so directly with the full decode results of the reward as reward.

(d) For the RL section, this paper adopts the policy gradient method. According to the policy gradient theory, the target function of the generator G can be expressed as follows:

The derivation result is: (see the original paper attached page for detailed derivation process)

(e) At intervals, when more realistic sentences are generated, re-train the discriminant D, where the target function of the discriminant is expressed as follows:

The algorithm structure diagram can be represented as follows:

Experiment

The experimental part mainly divides into the synthetic data experiment and the real data experiment.

(a) Synthetic data experiment: A random initial lstm generator A, a random generation of training data, to train a variety of build models.

The criteria for evaluation are: negative logarithm likelihood (cross entropy) NLL. Detailed experimental settings can refer to the original paper.

(b) Real-world data experiment: mainly displays the Chinese verse generation, the Obama speech generation, the music generation result. The experimental data were the Chinese poetry Data set (16,394), the Obama speech dataset (11,092 paragraphs), and the Nottingham Music DataSet (695 songs). The evaluation method is Bleu score, and the experimental results are as follows:

The article does not show the model generated poetry, etc., the specific effect.

3. Adversarial Learning for neural dialogue Generation thesis Link: https://arxiv.org/pdf/1701.06547.pdf

Thesis Source: jiweil/neural-dialogue-generation This paper was uploaded to ArXiv on January 26, 2017, and belongs to the newest Gan for NLP paper. In this paper, we mainly use the method of antagonistic training (adversarial training) to generate the open dialogue (open-domain dialogue Generation). This task is used as a reinforcement learning (RL) problem, and a joint training generator and a classifier are presented. As with Seqgan, this article is also using the result of the discriminant D as the reward part of the RL, this reward is used to reward generator g, which pushes generator g to generate conversations like human conversations. Overall, the idea of this article is the same as the Seqgan, but there are several different and improved places:

(a) Because this article is used for open dialog generation, the generator in this article uses the SEQ2SEQ model (rather than the ordinary lstm model). The hierarchical encoder (rather than CNN) is used for the discriminant.

(b) Two methods have been taken to compute reward for a fully generated or partially generated sequence. In addition to Monte Carlo search (similar to Seqgan), a new method for reward computation of partially generated sequences is proposed. The use of all complete (fully) and partial (partially) decoded sequences to train the discriminant will result in overfitting. Early-produced partial (partially) sequences appear in a number of training data, such as the first token y_1 generated will appear in all parts of the generation (partially generated) sequence. Therefore, we propose to randomly select a sample to train the discriminant D from each subsequence of y from the positive (positive) sequence y+ and the negative (negative) sequence. This method is faster than Monte Carlo search, but it also makes the classifier weaker and less accurate.

(c) in Seqgan, the generator can only indirectly reward or punish its own generated sequence by reward generated by the discriminant. Instead of directly retrieving information directly from the Gold-standard sequence. This kind of training is fragile, once the generator in a training batch, the discriminant will be very easy to judge the generated sentences (such as reward 0), the generator will be lost. The generator only knows that the resulting sentence is bad, but does not know how to adjust it to make the resulting sentence better. In order to solve this problem, in the process of updating the generator, the human-generated responses is entered in this paper. For these human-generated responses, the discriminant can set its reward to 1. The generator can still generate good responses in these cases.

(d) During the training process, some of the settings for dialogue system (trick). In this section, the reader can refer to Jiwei Li's previous paper on dialogue system. Some experimental results:

Worth thinking about: In this paper, only try to use the results of the discriminant as reward, combined with the original author in the dialogue system in the other reward mechanism (e.g, mutual information) will not improve the effect.

4 Gans for sequence of discrete elements with the Gumbel-softmax distribution thesis link: https://arxiv.org/pdf/1611.04051.p DF

Compared to the previous two papers, this paper is relatively simple to deal with the problem of discrete data violence. Discrete data (represented by the One-hot method) can generally be obtained from polynomial sampling, such as the output P = Softmax (h) of the Softmax function. According to the previous probability distributions, the process of sampling y with P probability is equivalent to: Y=one_hot (Argmax_i (h_i+g_i)), where g_i is subject to Gumbel distribution (with zero location and unit Scale). However, One_hot (Argmax (.)) is not differentiable. Unlike the original Gan, the author proposes a method to approximate the above equation: y = Softmax (1/r (H + g)). This formula can be differential. The algorithm structure is as follows: The experimental part of this paper is rough, only shows the generated Context-free grammar, did not do experiments in generating other text data.

in general, the paper itself is worth improving, and it can be used for reference.

5. Connecting generative adversarial network and Actor-critic methods papers Link: https://arxiv.org/pdf/1610.01945.pdf

Actor-critic methods [2]: Many RL methods (e.g, policy gradient) function only in policy or value functions. The Actor-critic method combines the methods of policy-only and value function-only. Where critic is used to approximate or estimate value function,actor is called policy structure, which is mainly used to select action. Actor-critic is a on-policy learning process. The results of the critic model are used to help improve the performance of actor policy.

There are many similarities between Gan and Actor-critic. The Actor function in the actor-critic model is similar to the generator in Gan, and they are all used to take an action or generate a sample. The critic in the actor-critic model is similar to the discriminator in Gan, which is used primarily to evaluate the output of Actor or generator. Specific to the same and different points, interested friends can carefully read the original.

The main contribution of this paper is to understand the similarities and differences between Gan and Actor-critic models from different perspectives, so as to encourage scholars who study Gan and scholars studying the Actor-critic model to collaborate to develop generic, stable, scalable algorithms, or to derive inspiration from their respective studies.

Recently Bahdanau and other great gods have proposed using actor-critic models to carry out sequence prediction [3]. Although [3] did not use Gan, perhaps to you can be inspired. With similar ideas, gan may also be able to achieve better results in sequence prediction.

[1] Deep reading: Gan model and its progress in the 2016

[2] actor-critic algorithms

[3] An actor-critic algorithm for sequence prediction

Original address: https://zhuanlan.zhihu.com/p/25168509