TY - GEN
T1 - Multi-task learning of hierarchical vision-language representation
AU - Nguyen, Duy Kien
AU - Okatani, Takayuki
N1 - Funding Information:
This work was partly supported by JSPS KAKENHI Grant Number JP15H05919 and JST CREST Grant Number JPMJCR14D1.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - It is still challenging to build an AI system that can perform tasks that involve vision and language at human level. So far, researchers have singled out individual tasks separately, for each of which they have designed networks and trained them on its dedicated datasets. Although this approach has seen a certain degree of success, it comes with difficulties of understanding relations among different tasks and transferring the knowledge learned for a task to others. We propose a multi-task learning approach that enables to learn vision-language representation that is shared by many tasks from their diverse datasets. The representation is hierarchical, and prediction for each task is computed from the representation at its corresponding level of the hierarchy. We show through experiments that our method consistently outperforms previous single-task-learning methods on image caption retrieval, visual question answering, and visual grounding. We also analyze the learned hierarchical representation by visualizing attention maps generated in our network.
AB - It is still challenging to build an AI system that can perform tasks that involve vision and language at human level. So far, researchers have singled out individual tasks separately, for each of which they have designed networks and trained them on its dedicated datasets. Although this approach has seen a certain degree of success, it comes with difficulties of understanding relations among different tasks and transferring the knowledge learned for a task to others. We propose a multi-task learning approach that enables to learn vision-language representation that is shared by many tasks from their diverse datasets. The representation is hierarchical, and prediction for each task is computed from the representation at its corresponding level of the hierarchy. We show through experiments that our method consistently outperforms previous single-task-learning methods on image caption retrieval, visual question answering, and visual grounding. We also analyze the learned hierarchical representation by visualizing attention maps generated in our network.
KW - Categorization
KW - Deep Learning
KW - Recognition: Detection
KW - Representation Learning
KW - Retrieval
KW - Vision + Language
UR - http://www.scopus.com/inward/record.url?scp=85078775859&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078775859&partnerID=8YFLogxK
U2 - 10.1109/CVPR.2019.01074
DO - 10.1109/CVPR.2019.01074
M3 - Conference contribution
AN - SCOPUS:85078775859
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 10484
EP - 10493
BT - Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PB - IEEE Computer Society
T2 - 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
Y2 - 16 June 2019 through 20 June 2019
ER -