ICLR 2017 | Attention and Memory Networks

Source: Internet
Author: User
ICLR 2017 | Attention and Memory NetworksOriginal 2016-11-09 Small S program Yuan Daily program of the Daily


Today sharing iclr 2017, the theme is Attention and Memory. Both as the hottest neural network mechanism and architecture from 2014 to 2016, the Vision of many performance and NLP missions have been raised to a great extent. In particular, Attention has become a new state-of-the-art, and Attention NN can hardly compete with attention-based models.


However, there is a close relationship between Attention and Memory, and both have their own shortcomings. ICLR 2017 has many discussions and improvements on both of them. So the papers that will be shared today are as follows (still only part of the rest of the next share):

1. Structured Attention Networks

2. Hierarchical Memory Networks

3. Generating Long and diverse responses with neural conversation Models

4. Sequence to Sequence transduction with Hard monotonic Attention

5. memory-augmented Attention modeling for videos



The first [1] is the top priority of today's recommendation, which unifies our classic attention mechanism and their new two (two attention layer) in one frame, making attention Mechanism from ordinary soft annonation into a new mechanism that can internally modeling structure information without destroying end-to-end training.


In particular, our previous attention mechanism is the encoder input x_1,..., x_n, this is the author called the X,decoder End has been decoded generated sequence y_1, ..., y_n, this is the author called Q (query). So, the attention mechanism can be seen as a attention position based on the X, q distribution, as follows:



With such a framework, we can turn a series of independent z into correlations, dependencies-and thus the distribution of Z's structure information-that is, a structured attention networks. To achieve this, the author uses CRF to model, and gives two specific examples.


The first example is the segmentation attention layer, which can be used to select the sub-sequence in the source sentence (instead of the classic attention in Word). Its design is also very simple and intuitive, that is, we put z like gate mechanism, become z=[z_1, ..., Z_n], and z_i \in {0,1}. In this way, we can get the following expression for our previous annonation:


This expression allows us to select multiple z+i at the same time, and there is a dependency between z_i and Z_j, which is modeled by the P (z_i | x,q) in the formula:


then the second example, in this although is seb-sequence but still is continuous, in order to carry on the foundation, has improved some, is called syntactic attention layer. its motivation is to directly model the syntax tree structure in our NLP. This time, we're going to change from z_i to Z_{ij} and use it to represent the parent node and child node in a pair of syntax trees. So, similarly, we can get the following modeling expression:


With these two new attention layer, the author has achieved better results than the classical attention layer in many tasks.




The second [2] thesis is to transform the Memory Networks into a hard between soft attention and attention trade-off. As we all know, attention mechanism actually provides an access that tells decoder how to Access/where to access--this is actually a read operation, and we often want to have a write operation, You can update some of our intermediate results while decoding. This is also the main difference between attention and memory, and is why they can be put into a unified framework.


and soft attention calculation is actually very large, hard attention and not enough stable--respectively take the essence of both of them, this second paper [2] The author came up with hierarchical memory Networks (HMN). HMN is the use of hierarchical structure, the soft attention each need to calculate the amount of reduction. So how do you "find" This hierarchical structure? They propose the use of a method called Maximum Inner Product Search (MIPS). MIPS can not only build this hierarchical form of memory, but also find the memory subset that is most relevant to query by calculating the value of this maximum inner product. Finally, because this kind of MIPS is very difficult to calculate, they also put forward several approximate calculation methods. Finally, they experimented on the Simplequestions (SQ) dataset, and the results were as follows:





The third article [3] I also recommend the paper. It comes from Google and the Google Brain team. At first glance, Abstract will feel that there is no new technology (research point), but in fact, the whole article for the current sequence-to-sequence conversation models problems and reasons behind, the analysis is quite clear. Although the final proposed solution are some partial engineering, but do not hinder is a worthy of our also to try a good job. And, in this article [3], the author also modifies the attention mechanism so that previously dependent input (Source-side) attention can fuse into the generated output (target-side) Information- So they became target-side attention.


Since the focus today is on introducing attention and memory, the other points in this article are not mentioned at first. Just say this target-side attention. The authors point out that our classical attention mechanism, when computed, contains only C, which contains only the information in the Source-side encoder-the model can only attend these candidates (for example, Input words). This would have been no problem in the task of machine translation, since our source sentence already contained all the information we wanted to produce, but there was a problem with the task of dialogue. The problem is what, we many times, source sentence, that is, the user said that sentence, is not enough informative, relatively short-this time is the decoder has produced output may be more helpful to us. The immediate solution, then, is to throw the sequence already produced in the decoder into the attention candidate pool. Just like this:


What good is this to do, the advantage is that our decoder hidden states can less "remember" some already generated information, also can better to do the whole semantic modeling and expression.



The second work of this paper [4] is the YOAV Goldberg of the conscience produced in NLP, so the first time to find it read. The attention mechanism designed in this paper [4] is similar to the third paper [3]-for example, the input that has been generated by the decoder end is considered. What is more unique is that it incorporates a hard attention idea, so that in the decoder process, decoder is not always in output (not every step produces outputs), but as a gate control, sometimes need to output , and some You will need to modify the attention value (move at the encoder end). So what are the advantages of such a mechanism, more than the classic soft attention of the past? This is because, soft attention is more dependent on the training data to automatically learn this alignments, but many times our small corpus does not exist enough to produce good alignments data pair--this time hard atten tion will be able to play a better role.


In particular, their main transformation lies in the control mechanism. Instead of only decoder output as the only action, the model adds the second action of the encoder hard attention:


So, when the training is in progress, it will be alternating between the Step/write two action (not stable alternating), only when the current action is write, output outputs; otherwise, when the action is step, E The hard attention head on the Ncoder side moves a position and calculates the new encoded representation:


Finally, in order to compare the effect and effects of the hard attention mechanism and classical soft attention mechanism, the author [4] also draws the following figure:


You can see how Legte->lege and Flog->fliege are learned in the hard attention mechanism (i.e., the model proposed by the author [4]).



Today, the last article to be recommended [5] from the topic can be seen, the attention and memory together. But in fact, I personally think, it is mainly the contribution of the combination is like the paper [3], pointed out that the decoder side needs to consider the encoder end of the attention, while also need to consider their own language model generation, it is difficult to balance of things. Even worse, because they [5] do video caption generation, this multiple-frame picture and the different attention location in each picture exacerbate this decoder difficulty.


Thus, they not only proposed a temporal modeler (TEM) to select each frame (the attention location), but also continued to propose a model between TEM and decoder language modeling Functions, which they call hierarchical Attention/memory (HAM):


here the f_m can implement a multi-layer attention, and the f_m can then be passed on to decoder as a balance attention. Final effect:





Still very tired ... It will continue tomorrow.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.