"Image understanding" show, attend and tell algorithm detailed

"Image understanding" show, attend and tell algorithm detailed _ depth learning

Last Update:2018-08-22 Source: Internet

Author: User

Tags scale image

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Xu, Kelvin, et al. "Show, attend and Tell:neural Image caption generation with visual attention." ArXiv preprint arxiv:15 02.03044 (2015).

Focusing mechanism (Attention mechanism) is one of the hot topics in the current depth learning, which can focus on the different parts of input and give a series of understandings. This thesis is a representative work of focusing mechanism, and completes the task of "look and speak" which is difficult in image comprehension.

The author provides the source code based on Theano (here), in addition to the enthusiastic crowd on the tensorflow to give the implementation (poke here).

This article contrasts with the TensorFlow version source code, detailed thesis algorithm. Data

From input to output experience encoding and decoding two parts.

Analogy: In machine translation, the encoding part of the source language into the basic semantic features, decoding part of the basic semantic features into the target language.

Input: Image I
Feature (annotation): {a1...ai...al}
Context: {Z1...ZT...ZC}
Output (Caption): {Y1...YT...YC}

I is the input of the color image.
The sequential YT consists of a sentence "description" (caption). Sentence length C indefinite. Each word yt is a k-dimensional probability, K is the size of the dictionary.

AI is a D-dimensional feature that has a total of L and describes different regions of the image.
ZT is also a D-dimensional feature, a total of C, which represents the context of each word.

The AI is generated at once, but the word ZT is generated one at a time, so use subscript t to emphasize each estimate. Network structure Coding (I→A)

Input image I normalized to 224x224. Feature a directly uses the 14x14x512 dimension features of the Conv5_3 layer in the ready-made Vgg network 1. Area quantity l=14x14=196, Dimension d=512.
Lower-level features are used to better describe local content.

The encoding is performed only once, and the decoding is done in words, and all the following network variables have step subscript t. Context Generation (A→Z)

The context zt of the current step is the weighted sum of the original definition A and the weight is αt 2. Like AI, ZT is also a D-dimensional vector. ：

Zt=αtt⋅a

The Αt dimension is l=196, which records the focus of each pixel position for interpretation a.

The weight αt can be obtained from the first step system implicit variable HT through several full connection layers. Coded et is used to store information in the previous step. The gray indicates that there are parameters in the module that need to be optimized.

"See where" is not only related to the actual image, but also by the impact of seeing things before. For example, et−1 see the rider, the next should look for horses.

The first step is determined entirely by the image feature a:

This section imposes weights on the entire graph feature, also known as the attention network. The implicit variable of the system is a m=256-dimensional feature that is obtained in the next step. Implicit variable generation (Z→H)

This section uses the current popular LSTM Structure 3 simulation steps between the memory relationship. In addition to the internal hidden state HT mentioned above, it also contains input it, forgetting ft, storage ct, output OT, candidate GT a total of 6 states. They are all m-dimensional variables.

Input i, output o, and forgotten F are three "gate variables" that control the strength of other States, both through the implicit state H of the previous step and the current context Z decision 4:
Candidate G describes the information that may go into storage and is generated in the same way:

Storage C is the core of the lstm, by the storage of the previous word and the current candidate G-weighted, the forgotten gate F controls the previous Word store, the input gate I controls this candidate:
Ct=ft⊙ct−1+it⊙gt

The hidden state h is changed by the storage, and the strength is controlled by the output gate O:
Ht=ot⊙tanh (CT)

The entire lstm is constructed as follows, and the h,c in the previous step is entered into this step.
Sentence generation (h→y)

Current implicit variable HT generates the current word YT through a full network.
Review

To complete this model, to summarize:
Images are generated by convolution networks; Depending on the previous state of the system, decide where to look now; with the focus on the feature weighting, get the current context, the previous system state, by the context of the implicit variable, the implicit variable directly deduced the current word. Training data

This paper uses three kinds of database flickr8k, flickr30k, MS COCO. Each sample contains a picture, along with a few calibrated sentences. The dictionary size used is k=10000. Optimization

To improve efficiency, each mini-batch consists of a sample of "sentences with the same length" and a mini-batch size of 64.

In the final error module, compare the output words of each step and the cross Entropy of the calibrated sentences, and use the Rm-sprop method to update the model parameters.

In addition, the use of machine translation common Bleu criteria, monitoring validation set of scores, as a early stopping sentence.

On the largest MS Coco database, the use of Nvidia Titan black training time is 3 days. Results

The

Bleu and meteor scores are improved compared to other algorithms. The

is particularly valuable in that this paper uses a single model and gives attention to each word in the absence of a detection module (obtained by Gaussian sampling on Alpha). K. Simonyan and A. Zisserman. Very Deep convolutional Networks for large-scale Image recognition. arxiv:1409.1556 [CS], Sept. 2014. arxiv:1409.1556. 3↩ here and the source code, only introduced in the paper soft attention method. Hard attention method Derivation is more cumbersome, can refer to the previous DRAM algorithm ↩ suitable for Getting Started Lstm Introduction: Http://www.open-open.com/lib/view/open1440843534638.html↩ paper, The previous step output yt−1 also participates in this step operation. This article takes TensorFlow source code as the standard. ↩

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More