hierarchical Question-image co-attention for Visual Question answering Jiasen lu, jianwei yang, dhruv batra, Devi Parikh (Submitted on to (V1), last revised Jan (this version, V5)) A number of recent works has proposed attention models for Visual Question answering (VQA) that Generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue the in addition to modeling ' where to look ' or visual attention, it's equally important to model "What words-listen to" or question attention. We present a novel co-attention model for VQA This jointly reasons about image and question attention. In addition, we model reasons about the question (and consequently the image via the co-attention mechanism) in a Hierarc Hical Fashion via a novel 1-dimensional convolution neural Networks (CNN). Our model improves the State-of-the-art in the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the Coco-qa dat Aset. By using ResNet, the performance are further improved to 62.1% for VQA and 65.4% for COCO-QA.
Comments: |
Pages, 7 figures, 3 tables in Conference on Neural information Processing Systems (NIPS) |
Subjects: |
Computer Vision and Pattern recognition (CS. CV); Computation and Language (CS. CL) |
Cite as: |
arxiv:1606.00061 [CS. CV] |
|
(or ARXIV:1606.00061V5 [CS. CV] for this version) |
Submission HistoryFrom:jiasen Lu [view email]
[V1]Tue, 22:02:01 GMT (4284kb,d)
[v2]Thu, 2 June 01:51:13 GMT (3549kb,d)
[V3]Wed, Oct 02:15:57 GMT (3669kb,d)
[V4]Fri, 16:18:03 GMT (3669kb,d)
[V5]Thu, 05:03:33 GMT (3669kb,d)