Symbolizing Visual Features for Pre-training with Unlabeled Images

Yuichi Kamata, Moyuru Yamada, Keizo Kato, Akira Nakagawa, Takayuki Okatani

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Multi-layer Transformers, which have shown good performance in natural language processing (NLP), have recently started to be used in multi-modal learning tasks that involve both texts and images. In the NLP part of the multi-modal learning, the approach of pre-training the parameters of Transformers from large unlabeled text data has been shown to contribute to an increase in accuracy. On the other hand, for the image part of the Transformer, there are no reports to show the validity of pre-training, even though, intuitively, the prospect of leveraging knowledge obtained from large amounts of unlabeled image data is appealing. This paper aims to construct a single modal pre-training model based on a Transformer in the image domain for multi-modal learning of texts and images. We have found that, unlike the case of discrete values representing word embeddings, current Transformers have trouble handling continuous values like image features. In order to overcome this limitation, we propose a Transformer with the list of features named SymboList which convert the continuous image features of detected objects into discrete ones by referring to a discrete key list. We demonstrate that our proposed method leads to effective image pre-training and is beneficial to the multi-modal down-stream task.

Original languageEnglish
Title of host publicationPattern Recognition - 6th Asian Conference, ACPR 2021, Revised Selected Papers
EditorsChristian Wallraven, Qingshan Liu, Hajime Nagahara
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages14
ISBN (Print)9783031024436
Publication statusPublished - 2022
Event6th Asian Conference on Pattern Recognition, ACPR 2021 - Virtual, Online
Duration: 2021 Nov 92021 Nov 12

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13189 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference6th Asian Conference on Pattern Recognition, ACPR 2021
CityVirtual, Online


  • Image pre-training
  • Multi-modal transformer
  • Visual Question Answering

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)


Dive into the research topics of 'Symbolizing Visual Features for Pre-training with Unlabeled Images'. Together they form a unique fingerprint.

Cite this