Paper notes: Deep Attention recurrent q-network

Source: Internet
Author: User

  

Deep Attention Recurrent q-network

5vision groups

absrtact : This paper introduces the Attention mechanism of DQN, which makes learning more directional and instructive. (Some time ago to do a work plan to do so, who thought, so soon by these children to achieve, shame AH (⊙o⊙))

Introduction : We know that DQN is a continuous 4 frames of video input into the CNN, then, although this has achieved good results, but still can only remember the 4 frames of information, the previous will be forgotten. So there are researchers who have proposed deep recurrent q-network (DRQN), a combination of LSTM and DQN work:

1. The fully connected layer in the latter are replaced for a LSTM one,

2. The last visual frame at each time step is used as DQN ' s input.

The author points out that although only one frame of information is used, DRQN still captures information about the frames. Nonetheless, there is still no systematic ascension on the Atari game.

Another drawback: long training time. It is said that the training time is up to 12-14 days on a single GPU. As a result, a parallel version of the algorithm was proposed to improve the training speed. The authors argue that parallel computing is not the only, most effective way to solve this problem.

  

Recently, visual attention models has achieved amazing results on every task. The advantage of using this mechanism is that you only need to select and then notice a smaller area of the image that can help reduce the number of parameters, helping to speed up training and testing. In contrast to DRQN, the LSTM mechanism of this article stores data not only for the next choice of actions, but also for selecting the next Attention region. In addition, in addition to the computational speed improvements, the attention-based models can also increase the readability of the deep q-learning, giving the researcher an opportunity to observe where and what the agent's concentrated area is and what it is.

  

Deep Attention Recurrent q-network:

as shown, the DARQN structure consists of three types of networks: convolutional (CNN), attention, and recurrent. At each time step t,cnn receive a representation of the current game state $s _t$, which is generated according to this state A set of D feature maps, each of which has a dimension of M * M. The Attention network converts these maps into a set of vectors $v _t = \{v_t^1, ..., v_t^l \}$,l = m*m, and then outputs its linear combination $z _t$, called a context vector. This recurrent network, where we are LSTM, takes the context vector as input, as well as the previous hidden state $h the _{t-1}$,memory state $c _{t-1}$, producing hidden stat E $h _t$ for:

1. A linear layer for evaluating q-value of each action $a _t$, the agent can take being in state $s _t$;

2. The attention network for generating a context vector at the next time step t+1.

  Soft Attention :

The "soft" Attention mechanism mentioned in this section assumes that the context vector $z _t$ can be represented as a weighted sum of all vectors $v _t^i$, each corresponding to a CNN feature extracted from different areas of the image. Weights are proportional to the importance of this vector and are measured by Attention network G. The G network consists of two FC layers followed by a softmax layer. Its output can be expressed as:

Where z is a normalizing constant. W is the weight matrix, Linear (x) = Ax + B is a radial transformation, the weight matrix is a, the deviation is B. Once we have defined the importance of each position vector, we can calculate the context vector as:

Another network is described in detail in the third section. The entire DARQN model is trained by minimizing the loss function of the sequence:

Where $Y _t$ is an approximate target value, in order to optimize this loss function, we use the standard q-learning update rule:

DARQN in the functions are differentiable, so each parameter has a gradient, the entire model can be end-to-end training. The algorithm of this paper also draws on the technology of target network and experience replay.

  Hard Attention:

The hard attention mechanism samples here require only one image patch to be sampled from the image.

Assuming that $s _t$ is sampled from the environment, it is affected by the attention policy, and the SOFTMAX layer of attention Network G gives the category distribution with parameters (categorical distribution). Then, in the policy gradient method, the update of the policy parameter can be expressed as:

Where $R _t$ is the loss of future discounts. To estimate this value, another network $G _t = Linear (h_t) $ is introduced. The network is trained on the network through the _t$ of expectations $Y. The final update of the Attention network parameter is done in the following manner:

where $G _t-y_t$ is the advantage function estimation.

  

The author provides the source code:Https://github.com/5vision/DARQN

  

Experimental Section :

  

  Summary :

  

Paper notes: Deep Attention recurrent q-network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.