"A structured self-attentive Sentence embedding" (Attention mechanism)

Source: Internet
Author: User

Background and motivation:

The existing general process for working with text is the first step: Word embedding. There are also some embedding methods that take into account the phrase and sentences. These methods can be broadly divided into two types: universal sentence (general sentences) and certain task (specific tasks); Conventional practice: Take advantage of the state of the last hidden layer of RNN, or RNN hidden states max or AV Erage pooling or convolved n-grams. Some work also takes into account parsing and dependency trees (parse and dependence trees);

For some work, people began to consider introducing additional information , using attention ideas to assist sentence embedding. However, for some tasks, such as emotional classification, this method cannot be used directly because there is no such additional information: The model is only given one single sentence as input. At this point, the most common practice is max pooling or averaging all the RNN time steps of the hidden layer state, or only the last moment of the state as the final embedding.

In this paper, a self-attention mechanism is proposed to replace the commonly used max pooling or averaging step. Because the author believes that: carrying the semantics along all time steps of a recurrent model was relatively hard and not necessary. Unlike previous methods, the self-attention mechanism in this paper allows different convenient information to be extracted from sentences to form representations of multiple vectors (allows extracting different aspects of the sentence into multiple vector representation). In our sentence mapping model, it is performed at the top of the LSTM. This ensures that the attention model can be applied to tasks with no additional information input and reduces some of the long-term memory burdens of lstm. Another benefit is that the embedding of visual extraction is very simple and intuitive.

approach Details:

1. Model

The proposed sentence embedding model contains two parts: (1) bidirectional lstm; (2) the self-attention mechanism;

Given a sentence, we first put it in Word embedding, get: S = (W1, w2, ..., WN), and then say these vectors into a 2-d matrix, the dimension is: n*d;

Then, in order to model the relationship between different words, we use bidirectional lstm to model, get the hidden layer state in two directions, then, we can get the dimension as: N*2u Matrix, recorded as: H.

In order to change the length of the sentence, encoded as a fixed embedding. We want to achieve this by choosing a linear combination of n LSTM hidden states. To compute such a linear combination, the self-attention mechanism is used, which takes all the hidden layer state H of lstm as input and outputs a vector weight a:

  

whichWs1 WS1 is the size ofda? 2u The weight matrix of the da?2u,Ws2 WS2 is the size ofda da's vector parameters, where Dada is what we can set ourselves. Since the size of H is: N * 2u, the annotation vector a size of n,the Softmax () function ensures that the calculated weight Plus is 1. Then we weighted the lstm of the hidden layer state H and attention weight a to get the vector mafter the attend.

The representation of a vector usually focuses on a particular component of a sentence, such as a particular related word or a set of words. So, we need to reflect the different semantics of composition and slowing down. However, there may be several different ingredients in a sentence, especially long sentences. So, in order to represent the overall semantics of a sentence, we need multiplem' s to focus on the different parts. So, we need to use: multiple hops of attention. That is: We want to extract a different part of R from the sentence, we willWs2 WS2 expands to:r? Da R?da Matrix, recorded as:Ws2Ws2, then the resulting annotation vector a changed to Annotation matrix A. Formally speaking:

Here, Softmax () is executed along the second dimension of the input. We can consider the formula (6) as a 2-layer MLP without bias.

The map vector m then becomes:r? 2ur?2u of embedding matrix M. We multiply the hidden layer state H of annotation A and lstm to get the R weighted sums, and the result matrix is the sentence mapping:

M = AH

Summary: The first kind of attention is actually to the hidden layer h to join the 2-layer full connection, fully connected to the last layer of dimension 1, the second attention to the last layer of the dimension is not 1

2. penalization term

When the attention mechanism always provides a similar summation weights for all the R hops, the mapping matrix M may be affected by redundancy problems. Then we need a penalty to estimate the diverse of summation weight vectors change.

The best measure of the measure between the two summation weight vectors is: KL divergence (kullback Leibler divergence), however, the author finds that this problem is not appropriate. The author suspects that this is because: we are maximizing a set of KL divergence, we is optimizing the annotation matrix A to has a lot of sufficiently Small or even zero values at different Softmax output units, and these vast amount of zeros is making the training Unstab Le. Another KL does not have the characteristics, but what we lack need is: we want to each individual row to focus on a single aspect of semantics, so we want the probabilty Mass in the annotation Softmax output to is more focused.

We multiply A by its transpose and subtract the unit matrix as a measure of its redundancy:

Experiments:

Reference:

Http://www.cnblogs.com/wangxiaocvpr/p/9501442.html

A structured self-attentive sentence embedding (attention mechanism)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.