TY - GEN
T1 - Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering
AU - Nguyen, Duy Kien
AU - Okatani, Takayuki
N1 - Publisher Copyright:
© 2018 IEEE.
Copyright:
Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2018/12/14
Y1 - 2018/12/14
N2 - A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
AB - A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
UR - http://www.scopus.com/inward/record.url?scp=85061705988&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061705988&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2018.00637
DO - 10.1109/CVPR.2018.00637
M3 - Conference contribution
AN - SCOPUS:85061705988
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 6087
EP - 6096
BT - Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
PB - IEEE Computer Society
T2 - 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
Y2 - 18 June 2018 through 22 June 2018
ER -