NIPS 2016 | Best Paper, Dual Learning, Review Network, VQA and other papers selectedOriginal 2016-12-12 Small S program Yuan Daily program of the Daily
The most watched academic event of the past week has been the NIPS 2016 meeting in the beautiful Barcelona. Every year NIPS meetings, there will be very heavyweight tutorial and work published. Today we recommend and share the following articles:
Value Iteration Networks (NIPS 2016 best Paper)
Dual Learning for Machine translation (NIPS 2016)
Review Networks for Caption Generation (NIPS 2016)
Visual question answering with question representation Update (QRU) (NIPS 2016)
Gated-attention Readers for Text comprehension (iclr 2017 submission)
Vin
Value Iteration Networks (NIPS 2016 best Paper)
As this year's NIPS 2016 best Paper winner, I believe that everyone has been a variety of copy before the screen. Deserved a job. The idea behind it is ingenious. The idea of the main motivation from two observations, the first observation is that planning should be as an important part of policy, so should be the ability to planning, as policy representation Part of the G process. Then the second observation, which is the most ingenious, classic the formula of value iteration almost perfectly with convnet match.
The first observation has the advantage that they think, after joining planning, can improve the generalization ability of RL model. To this end, they think that the planning module, which is the Value iteration Networks (VIN), should be added between the model, from observation to reactive policy. The following figure:
The second observation is the mapping between the VI formula and the convnet. The classic VI formula is this:
As a result, the VI module of this paper is designed to use R (s,a) as the input of convnet, so R (S,a) is called "reward image" by the author and becomes a multi-layer input image of Convnet. And then, like this picture:
With R (s,a) as convnet input, discounted transition probabilities P is convnet. The max_q corresponds to max pooling. The last multiple stack + re-feed, you can realize the K recurrence iteration--is the Value interation. Use a picture of slides to summarize:
In the author's slides, the author also mentions that, in many cases, we only need to use a portion of policy representation (planning and observation) to get our action. So they also introduced attention to improve efficiency:
Dual-nmt
Dual Learning for Machine translation (NIPS 2016)
MSRA a paper that was vigorously publicized. The idea behind it is very straightforward--Machine translation as two agents,agent A and agent B teach each other the language. The assumption here is that agent A only understands its own language language A,agent B only understands its own language language B. Agent A says a x_a, after a a->b (weak) MT model (which is actually a noisy channel), gets a x_a '. At this time agent B although got a x_a ', but actually do not know what agent A is intended to speak (semantics), only through their own language B mastery, to measure whether the x_a ' is a legitimate sentence language B (grammar). Agent B can then "translate" the phrase back to agent a (another noisy channel) in the same way, so that agent a can evaluate reconstruction by comparing the X_a "again" and the original sentence x_a it. of quality.
For a true RL under this framework, we have actually two a large number of monolingual corpus A and B, and a, b do not need aligned. At the same time, we have two weak MT model, namely A->b and B->a. At the same time, we have two very good language model,lm_a and Lm_b, because the training LM only needs the monolingual corpus, so LM is very easy to obtain. Then just said X_a-> x_a ', Agent B can give a reward for x_a ', that is Lm_b (X_a '). and X_a '-> x_a ', agent A can be for reconstruction quality, also give a reward. These two reward through the combination of phenomena, and then use policy gradient, can be solved.
Finally, take a look at some of the experimental results:
According to the author, this dual task is still very much: Actually, many AI tasks are naturally in dual form, for example, speech recognition versus Tex T to speech, Image caption versus image generation, question answering versus question (generation., e.g), jeopardy! H (matching queries to documents) versus keyword extraction (extracting
Keywords/queries for documents), "so" and so forth. But I think it's worth question.
At the same time, according to the author, this setting is not limited to dual, do not need two agents, the key is to find Close-loop. Actually, our key idea are to form a closed loop so and we can extract feedback signals by comparing the original input da Ta with the final output data. Therefore, if more than two associated tasks can form a closed loop, we can apply my technology to improve the model in E Ach task from unlabeled data. In fact, the point here is that the key is to find a transitive process, so that reward can pass down, rather than at some point fix or block live.
In addition, the idea of the various tasks that reconstruction applies to NLP is also common. The use of dual learning modeling reconstruction is a very clever and beautiful job. In addition, in the MT field, the front also has the Noah's Ark "neural Machine translation with reconstruction" and from Google's "Google" multilingual neural the Machine Tr Two essays by anslation system:enabling zero-shot translation. In other tasks, such as response retrieval and generation, there is also work on the reconstruction loss as an additional objective, linear integration, the idea is more intuitive: want to let Machine learn to speak their own words, First let it be a parrot. The use of reconstruction loss, in summarization and other tasks are not uncommon. We can dig a lot of ourselves.
Review Networks
Review Networks for Caption Generation (NIPS 2016)
This paper comes from the group of Professor Ruslan, who has been very insight in attention and generative models. When the soft attention started to fire, he had the attention algorithm of hard + soft wake-sleep combination. This Review Networks paper, NIPS 2016, is still a work of improving attention. At the same time, this improvement can fit the attention of NLP and fit into the Vision visual attention.
Specifically, in the classic attention-based SEQ2SEQ model, attention is used for decoder. That is, we put something in the encoder through the attention, get some kind of representation, this representation is often the weighted sum in soft attention, so it will also be called a summarization--the summarization of the input to encoder. This work argues that this weighted sum is more local, with the focus on the more topical information-they want to increase their focus on global information.
To this end, the author's approach is to add a review module, which is the review networks in the title, so that after modeling, the classical attention we mentioned above is a special case under the framework that they put forward.
Review Networks as pictured above. By contrasting the left and right, it is easier to understand the Review Networks mechanism. It is the equivalent of replacing the original attention part with a LSTM network to obtain a more compact and global attention--author of the attention encoded representation called FAC T. The results of this Review network in image captioning look pretty good:
The full name of the Disc Sup here is discriminative supervision, which is another network of those facts that the author thinks has Review benefit. That is, can discrminatively judgment, whether facts get words contained in caption. This Disc Sup can help improve the training effect by multi-task the learning framework.
Qru
Visual question answering with question representation Update (QRU) (NIPS 2016)
VQA This task is now very hot, but many of the practices are in the image at the end of the improvement. This paper is a change from the text, which is the question end. Specifically, it is in accordance with the image of the proposal to constantly update question representation, in fact, let the image information fuse into text (question). In an earlier article ECCV 2016 submission "A focused Dynamic Attention Model for Visual question answering" clearly pointed out the word fusion (multimodal R Epresentation Fusion). The following figure a picture wins thousand words:
Similar ideas in fact have very changeable species, in different fields, different tasks can find similar figure. For example, in the Reading comprehension task of NLP, the document representation is constantly based on attention to make document representation more "inclined" to question. Over question updates. This update is typically done with the multiply function, and the result is that the document Representatio bias to question representation, making it easier to find the document for Questio N's answer (ie reading comprehension). A specific article example, you can see "gated-attention Readers for Text comprehension (iclr 2017 submission)". is also Ruslan work, early on the ARXIV, ICLR submission This version of the improved writing, related work part is also worth seeing. The model is more complex at first glance, but it is well understood:
In doing Reading comprehension task, this work is equivalent to constantly update document representation, but also to embed question.
Today's sharing is here, and we welcome you to communicate with us a lot. We'll see you next time. (Try not to jump the ticket ...)