Visual Question answering with memory-augmented Networks

Source: Internet
Author: User

Visual Question answering with memory-augmented Networks
2018-05-15 20:15:03

Motivation:

Although VQA has made great progress, this method still has a poor performance for the full General,freeform VQA, the author thinks it is because of the following two points:

1. Deep models trained with gradient based methods learn to respond to the majority of training data rather than speci FIC scarce exemplars ;

The depth model trained by gradient descent method has good corresponding to the main training data, but it is not for the specific sparse sample;

2. Existing VQA systems Learn about the properties of objects from question-answer pairs, sometimes indepently of the imag E.

Selective attention to certain areas of the image is a very important strategy.

We are inspired by the recent memory-augmented neural networks and co-attention mechanism, in this article we use Memory-networks to remember rare events and then use memory -augmented networks with attention to rare answers for VQA.

The proposed algorithm :

In this paper, the algorithm flow as shown, first use embedding method, extract the problem and image feature, then co-attention Learning, and then the two weighted feature to be combined, and then input into the memory network, The choice of the final answer.

Image Embedding: extracting feature with pre-trained model;

Question embedding: using bidirectional LSTM network to study language features;

Sequential co-attention:

Here the synergistic attention mechanism, taking into account the common characteristics of images and texts, interacts with each other and gets a common attention mechanism. We get a base vector M0 based on the average of visual features and linguistic features:

We use a two-layer neural network for the calculation of soft attention. For visual attention,the soft attention and weighted visual feature vectors are:

of which Wv, Wm,wh are said to hidden states. Similarly, we calculate the weighted problem eigenvectors, as follows:

We will use a weighted combination of V and q to represent the input image and the problem pair, Figure 4, showing the entire process of the co-attention mechanism.

Memory augmented Network:

The Rnns lack external memory to maintain a long-term memory for scarce training data. This paper use a memory-augmented NN for VQA.

In particular, we used the standard LSTM model as controller, which worked as receives input data and then interacted with the external memory module. External memory, Mt, is a series of row vectors as memory slots.

The XT represents a combination of visual features and textual features; YT is the answer to the encoded question (one-hot encoded answer vector). The XT is then entered into the LSTM controller, such as:

For reading from an external memory unit, we will use the hidden state HT as a query for MT. First, we calculate the cosine distance for each line in the search vector HT and memory:

We then use the cosine distance to calculate a read weight vector wr with Softmax:

With these read-weights, a new retrieval of the memory RT can be obtained by the following formula:

Finally, we combine the new memory vector RT and Controller hidden State HT, and then produce the output vector ot for L Earning classifier.

We use the usage weights wu to control write to memory. We update the usage weights by attenuating the previous state:

To calculate the write weights, we introduce a truncation mechanism to update the least-used positions. Here, we use M (v, N) to represent the n-th smallest element of a vector v. We use a learnable sigmoid gate parameter to calculate the weights convex of the previous read weights and usage combination:

A larger n results in maintaining a longer term of memory of scarce training data. Compared to the internal memory unit of LSTM, two parameters can be used to adjust the rate of writing to exernal memories. This gives us more freedom to adjust the update of the model. The hidden layer State Ht of the output in equation (12) can be written to memory according to the write weights:

Answer Reasoning:

With the hidden state HT and the reading memory RT in the external memory unit, we combine the two together as a representation of the current problem and picture, entered into the classification network, And then get the distribution of the answers.

---done!

Visual Question answering with memory-augmented Networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.