Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering

Duy Kien Nguyen, Takayuki Okatani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

219 Citations (Scopus)

Abstract

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
PublisherIEEE Computer Society
Pages6087-6096
Number of pages10
ISBN (Electronic)9781538664209
DOIs
Publication statusPublished - 2018 Dec 14
Event31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 - Salt Lake City, United States
Duration: 2018 Jun 182018 Jun 22

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
Country/TerritoryUnited States
CitySalt Lake City
Period18/6/1818/6/22

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering'. Together they form a unique fingerprint.

Cite this