The original paper: A structured self-attentive sentence embedding introduction
This article presents a model that uses self-attention techniques to generate explanatory sentence vectors. Usually we use vectors to represent words, phrases, or sentence vectors, and in this article, the authors suggest that two-dimensional sentences can be used to represent sentences, and that each line of the matrix represents different parts of the sentence. The authors performed 3 different tasks on 3 different datasets, namely author Profiling,sentiment classification and textual entailment, all of which yielded good results. Model
The model proposed by the author is composed of 2 parts. The 1th part is a bidirectional lstm, and the 2nd part is self-attention mechanism, which provides weights for adding and weighting the hidden layer state of lstm.
The model structure diagram is as follows a:
Model input: s= (w1,w2,..., wn) ∈rn∗d s= (W_1, W_2, ..., w_n) \in r^{n*d}, a sequence of N tokens, WI w_i denotes word embedding for the I tokens in the sequence
The input s enters a bidirectional lstm, and the forward and back hidden states corresponding to the first T words are computed as follows:
The hidden states of forward and back are then connected to the HT h_t for subsequent computations, ht∈r2u h_t\in r^{2u} If the number of hidden cells for lstm is U. The text uses the sentence h∈rn∗2u h\in r^{n*2u} to represent the collection of all hidden state H:
At this point, Bilstm's mission is over.
The B section above is a simple feedforward neural network for generating self-attention:
A=softmax (Ws2tanh (WS1HT)) A=softmax (W_{s2}tanh (w_{s1}h^t))
Among them, Ws1∈rda∗2u,ht∈r2u∗n,ws2∈r1∗da w_{s1}\in r^{d_a*2u}, H^t\in r^{2u*n}, W_{s2}\in r^{1*d_a}, so the final shape of A is R1∗n R^{1*n}, And because the Softmax function is used for normalization, each dimension of a can be considered as the attention of the corresponding position word.
At this point, the self-attention mechanism is complete.
Then we can get the sentence expression m=a⋅h∈r1∗2u