An overview of attention mechanism in neural networks

Source: Internet
Author: User

Summarize the 7th chapter from this paper

Attention mechanism

The attention mechanism is a mechanism used in the encoder-decoder structure and is now used in a variety of tasks:

    • MT (Neural machine translation, NMT)
    • Image description (Images captioning (translating an image to a sentence))
    • Text Summary (summarization (translating to a more compact language))

And it is no longer limited to the encoder-decoder structure, multiple variants of the attention structure, applied in various tasks.

In general, attention mechanisms are applied in:

    • Allowing the decoder to focus on the information it needs in multiple vectors in the sequence is the traditional use of the attention mechanism. More information is retained because the encoder multi-step output is used instead of a single, long vector of the corresponding step.
    • Acting on encoders to solve characterization problems (e.g. encoding vectors again as input to other models), generally using self-attention (self-attention)
1. Encoder-Decoder Attention Mechanism 1.1 encoder-decoder structure

For example, the encoder embeds the input as a vector, and the decoder outputs it according to the vector. Because of this general application scenario (machine translation, etc.), its input and output are sequences , so it is also called sequence-to-sequence model Seq2seq.

For the training of encoder-decoder structure, because this kind of structure is everywhere , the parameter \ (\theta\) of the model can get the optimal solution by training data and maximum likelihood estimation , maximizing the logarithmic likelihood function to obtain the parameters of the optimal model. That

\[\arg\max\limits_{\theta}\{\sum\limits_{(x, y) \in{corpus}}\log{p (Y|x;\theta)}\}\]

This is an end-to-end training approach.

1.2 Encoder

The original input is encoded as a vector by a network model (CNN, RNN, DNN). Since attention is being studied here, two- way RNN are used as an example model.

For each time step \ (t\), the two-way RNN encoded vector \ (h_t\) can be represented as follows:

1.3 Decoder

The decoder here is a one-way RNN structure to produce output, stroke sequences at each point in time. Since the decoder uses only the last time step \ (t\) corresponding to the encoder's hidden vector \ (h_{t_x}\), \ (t_x\) refers to the current sample's time step length ( For NLP problems, all samples are often treated as equal length. This forces the encoder to integrate more information into the last hidden vector \ (h_{t_x}\) .

However, because \ (h_{t_x}\) is a certain vector of a single length, this single vector is limited in its representation , and the amount of information contained is limited, and much of it will be lost.

The attention mechanism allows the decoder to consider the entire encoder output in a hidden state sequence at each time step ( t\) ( h_1, h_2, \cdots, h_{t_x }), so that the encoder stores more information in all the hidden state vectors, while the decoder uses these hidden vectors to determine which vectors are more concerned about.

Specifically, each output (such as word)\ (y_t\)in the target sequence of the decoder production ( (Y_1, \cdots, y_{t_x}) \) is based on the following conditional distribution:

\[p[y_t|\{y_1,\cdots,y_{t-1}\},c_t]=softmax (w_s \tilde{h}_t) \]

where \ (\tilde{h}_t\) is a hidden state vector (attentional hidden) that introduces attention, as follows:

\[\tilde{h}_t=\tanh (w_c[c_t;h_t]) \]

\ (h_t\) is the hidden state of the top layer of the encoder, \ (c_t\) is the context vector, is calculated by the hidden vector of the current time step context, there are two kinds of calculation methods, mainly global and local , Mentioned in the afternoon. \ (w_c\) and \ (w_s\) parameter matrices are trained. For the simplification of the formulas, there is no bias to display.

1.4 Global attention

With the global attention calculation context vector \ (c_t\) , the entire sequence of the hidden vector \ (h_t\)is obtained by weighted sum . Suppose that for sample \ (x\) The sequence length is \ (t_x\), \ (c_t\) is computed as follows:

\[c_t=\sum\limits_{i=1}^{t_x}\alpha_{t,i}h_i\]

Where the length of \ (t_x\) of the calibration vector alignment vector \ (\alpha_t\) is the function of the \ (t\) time step, Hides the importance of all vectors in the state sequence. Each of these elements \ (\alpha_{t,i}\) is calculated using the Softmax method:

\[\alpha_{t,i}=\frac{\exp (Score (h_t, h_i))}{\sum\limits_{j=1}^{t_x}\exp (Score (h_t, H_j))}\]

The size of the value indicates which time step in the sequence is the size of the predicted current time step \ (t\) .

The score function can be any function of a pair of vectors, generally used:

    • Dot product: \ (Score (h_t, h_i) =h_t^th_i\)

      This has a better effect when using global attention.

    • Using the parameter matrix:

      \ (Score (h_t, h_i) =h_t^tw_{\alpha}h_i\)

      This is equivalent to using an all-connected layer , which has a better effect when using local attention.

The summary of global attention is as follows:

1.5 Local attention

Global attention needs to be calculated at all time steps in the sequence, the cost of the calculation is relatively high, you can use a fixed window size Local attention mechanism, the window size is \ (2d+1\). \ (d\) is a hyper-parameter, one-direction distance from the center of the window edge. The context vector \ (c_t\) is calculated as follows:

\[c_t=\sum\limits_{i=p_t-d}^{p_t+d}\alpha_{t,i}h_i\]

You can see that just consider the difference in the time step range, the other is exactly the same. \ (p_t\) as the center of the window, you can directly make it equal to the current time step \ (t\), can also be set to a variable, through training, namely:

\[p_t=t_x\sigma (V_p^t\tanh (w_ph_t)) \]

where \ (\sigma\) is the sigmoid function, \ (v_p\) and \ (w_p\) are all training parameters. So the computed \ (p_t\) is a floating- point number, but this has no effect, because when the calibration weight vector \ (\alpha_t\) is computed, an average value of \ (p_t\)is added, Normal distribution of the standard deviation is \ (\frac{d}{2}\) :

\[\alpha_{t,i}=\frac{\exp (Score (h_t, h_i))}{\sum\limits_{j=1}^{t_x}\exp (Score (h_t, H_j))}\exp (-\frac{(i-p_t) ^2}{ 2 (D/2) ^2}) \]

Of course, there are \ (p_t\in\mathbb{r}\cap[0, t_x]\), \ (i\in\mathbb{n}\cap[p_t-d, p_t+d]\). Due to the existence of the normal term, the attention mechanism at this time considers that the vector corresponding to the time step near the center of the window is more important and, in contrast to the global attention, adds a truncation, a truncated normal distribution, in addition to the normal term.

The local attention mechanism is summarized as:

2. Self-attention 2.1 differs from encoder-decoder attention mechanism

The biggest difference is that the self-attention model does not have decoders . So there are two of the most direct differences:

    • context Vector \ (c_t\) in the seq2seq model, \ (c_t=\sum\limits_{i=1}^{t_x}\alpha_{t,i}h_i\), which is used to compose the decoder Input \[\tilde{h}_t=\tanh (w_c[c_t;h_t]) \], but since the self-attention mechanism does not have a decoder, this is directly the output of the model, which is \ (s_t\)
    • When calculating the calibration vector \ (\alpha_t\) , the seq2seq model uses the comparison values of the hidden vectors in each position with the current hidden vectors. In the self-attention mechanism, each element in the calibration vector is calculated from the hidden vector of each position and the average optimal vector of the current time step \ (t\) , which is obtained by training
2.2 Self-attention mechanism realization

First, the hidden vector \ (h_i\) is entered into the fully connected layer (the weight matrix is \ (w\)), which gets \ (u_i\):

\[u_i=\tanh (wh_i) \]

Use this vector to calculate the correction vector \ (\alpha_t\), which is obtained by Softmax Normalization:

\[\alpha_{t,i}=\frac{\exp (Score (u_i, u_t))}{\sum\limits_{j=1}^{t_x}\exp (Score (U_j, u_t))}\]

Here \ (u_t\) is the current time step \ (t\) corresponding to the average optimal vector , each time step is different, this vector is obtained through training .

Finally, the final output is computed:

\[s_t=\sum\limits_{i=1}^{t}a_{t, I}h_i\]

In general, in sequence problems, only the output of the last time step is concerned, the previous time step does not output, that is, the final output is \ (s=s_t\)

2.3 Levels of attention

For example, for a NLP problem, two self-attention mechanisms are used throughout the architecture: the word level and the sentence level . Conforms to the natural hierarchical structure of the document:

Word -and sentence-by-document. In each sentence, determine the importance of each word, and in the whole document, determine the importance of the different sentences.

An overview of attention mechanism in neural networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.