TY - GEN
T1 - Symbolizing Visual Features for Pre-training with Unlabeled Images
AU - Kamata, Yuichi
AU - Yamada, Moyuru
AU - Kato, Keizo
AU - Nakagawa, Akira
AU - Okatani, Takayuki
N1 - Publisher Copyright:
© 2022, Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Multi-layer Transformers, which have shown good performance in natural language processing (NLP), have recently started to be used in multi-modal learning tasks that involve both texts and images. In the NLP part of the multi-modal learning, the approach of pre-training the parameters of Transformers from large unlabeled text data has been shown to contribute to an increase in accuracy. On the other hand, for the image part of the Transformer, there are no reports to show the validity of pre-training, even though, intuitively, the prospect of leveraging knowledge obtained from large amounts of unlabeled image data is appealing. This paper aims to construct a single modal pre-training model based on a Transformer in the image domain for multi-modal learning of texts and images. We have found that, unlike the case of discrete values representing word embeddings, current Transformers have trouble handling continuous values like image features. In order to overcome this limitation, we propose a Transformer with the list of features named SymboList which convert the continuous image features of detected objects into discrete ones by referring to a discrete key list. We demonstrate that our proposed method leads to effective image pre-training and is beneficial to the multi-modal down-stream task.
AB - Multi-layer Transformers, which have shown good performance in natural language processing (NLP), have recently started to be used in multi-modal learning tasks that involve both texts and images. In the NLP part of the multi-modal learning, the approach of pre-training the parameters of Transformers from large unlabeled text data has been shown to contribute to an increase in accuracy. On the other hand, for the image part of the Transformer, there are no reports to show the validity of pre-training, even though, intuitively, the prospect of leveraging knowledge obtained from large amounts of unlabeled image data is appealing. This paper aims to construct a single modal pre-training model based on a Transformer in the image domain for multi-modal learning of texts and images. We have found that, unlike the case of discrete values representing word embeddings, current Transformers have trouble handling continuous values like image features. In order to overcome this limitation, we propose a Transformer with the list of features named SymboList which convert the continuous image features of detected objects into discrete ones by referring to a discrete key list. We demonstrate that our proposed method leads to effective image pre-training and is beneficial to the multi-modal down-stream task.
KW - Image pre-training
KW - Multi-modal transformer
KW - Visual Question Answering
UR - http://www.scopus.com/inward/record.url?scp=85130273096&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85130273096&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-02444-3_37
DO - 10.1007/978-3-031-02444-3_37
M3 - Conference contribution
AN - SCOPUS:85130273096
SN - 9783031024436
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 490
EP - 503
BT - Pattern Recognition - 6th Asian Conference, ACPR 2021, Revised Selected Papers
A2 - Wallraven, Christian
A2 - Liu, Qingshan
A2 - Nagahara, Hajime
PB - Springer Science and Business Media Deutschland GmbH
T2 - 6th Asian Conference on Pattern Recognition, ACPR 2021
Y2 - 9 November 2021 through 12 November 2021
ER -