1. Preface
Google published a paper name in 2017 to teach attention is all you need, proposed a structure based on attention only to deal with sequence model related problems, such as machine translation. Most of the traditional neural machine translation uses RNN or CNN as the basis of Encoder-decoder model, while Google's newest transformer model based on attention has abandoned its inherent stereotypes and has not used any CNN or RNN structure. The model can work in a highly parallel manner, so the training speed is also very fast while improving translation performance.
2. Transformer model structure
The main structure of the transformer diagram:
2.1 Encoder Decoder
The model is divided into two parts of encoder and decoder .
- The encoder is stacked with 6 identical layers, each with two layers. The first support layer is a long self-attention mechanism, and the second one is a simple all-connected feedforward network. A residual connection is added outside the two tiers, and then the layer nomalization is performed. The output dimensions of all the support layers and embedding layers of the model are \ (d_{model}\).
- The decoder also stacks up to six identical layers. But each layer, in addition to the two layers in the encoder, the decoder also added a third layer, shown in the same also used residual and layer normalization. I'll tell you the details later.
2.2 Input Layer
The input of the encoder and decoder is to convert the tokens (usually words or characters) into D-dimensional vectors using a well-learned embeddings. For the decoder, the decoded output is converted to a probability of predicting the next token using a linear transformation and a softmax function.
2.3 Position Vector
Because the model does not have any loops or convolution, in order to use sequential information of the sequence, it is necessary to inject the relative and absolute position information of tokens into the model. The paper adds a "position code" based on the input embeddings. Both the location encoding and the embeddings by the same dimension are \ (d_{model}\) so the two can be added directly. There are many choices of location coding, both learned and fixed.
2.4 Attention Model 2.4.1 scaled attention
The attention used in this paper is the basic method of point multiplication, that is, a so-called scale. The input includes queries with Dimension \ (dk\) and keys, as well as values for dimension \ (d_v\) . Computes the point multiplication of the query and all keys, and then each is divided by \ (\sqrt{d_k}\)(This operation is called scaled). Then use a Softmax function to get the weights of the values.
In practice, the attention function is performed at the same time on some column queries, which queries together to form a matrix \ (q\) together with the keys and values to form a matrix \ (k\) and \ (v\). Then the output matrix of the attention can be calculated according to the following formula:
\[attention (q,k,v) = Softmax ({qk^t\over {\sqrt {d_k}}}) v\]
2.4.2 Multi-head Attention
The attention in the structure of this article is not simply to apply a point-multiply attention. The authors found that the effects of the linear mapping of queries,keys and values in different h\ are particularly good. The learned linear mappings map to \ ( d_k\),\ (d_k\) , and \ (d_v\) dimensions, respectively. The output value of the \ (dv\) dimension is generated by the parallel operation of the attention function for the resulting queries,keys and values after each mapping, respectively. The concrete structure and formula are as follows.
\[multihead (q,k,v) = Concat (head_1,..., head_h) \]
\[where:head_i = Attention (q{w_i}^q,k{w_i}^k,v{w_i}^v) \]
Attention in the 2.4.3 model
Transformer uses a multi-attention in three different ways.
- In the attention layer of encoder-decoder, queries comes from the previous decoder layer, and keys and values come from the encoder output. This is similar to the attention mechanism used by many of the SEQ2SEQ models that have been proposed.
- The encoder contains a self-attention layer. In a self-attention layer, all keys,values and queries are from the same place, in this case the output from the previous layer of encoder.
- Similarly, the self-attention layer in the decoder is the same. The difference is that a mask is added to the scaled point by attention operation, which ensures that the Softmax operation does not connect the illegal values to the attention.
Previously said the model consists of six layers stacked together, each layer consists of two layers, the attention layer is one of them, and the other layer after attention is a feedforward network. The formula is described below.
2.4.4 Feed Foreword
Each layer consists of two layers, and the attention layer is one, and the other one behind attention is a feedforward network. The formula is described below.
\[FFN (x) = max (0,xw_1 + b_1) w_2 + b_2\]
3. Summary
The whole framework of the model is basically introduced, and its most important innovation should be the architecture of self-attention and multi-head attention. In the case of abandoning traditional CNN and RNN, it can also improve performance and reduce training time. Transformer is used for machine translation tasks, is excellent, can be parallelized, and greatly reduces training time. It also gives us an idea of how to increase the choice of structure when dealing with problems.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
Summary of the principles of Attention is all need (Transformer)