Extracting representative subset from extensive text data for training pre-trained language models

Jun Suzuki, Heiga Zen, Hideto Kazawa

研究成果: Article査読

抄録

This paper investigates the existence of a representative subset obtained from a large original dataset that can achieve the same performance level obtained using the entire dataset in the context of training neural language models. We employ the likelihood-based scoring method based on two distinct types of pre-trained language models to select a representative subset. We conduct our experiments on widely used 17 natural language processing datasets with 24 evaluation metrics. The experimental results showed that the representative subset obtained using the likelihood difference score can achieve the 90% performance level even when the size of the dataset is reduced to approximately two to three orders of magnitude smaller than the original dataset. We also compare the performance with the models trained with the same amount of subset selected randomly to show the effectiveness of the representative subset.

本文言語English
論文番号103249
ジャーナルInformation Processing and Management
60
3
DOI
出版ステータスPublished - 2023 5月

ASJC Scopus subject areas

  • 情報システム
  • メディア記述
  • コンピュータ サイエンスの応用
  • 経営科学およびオペレーションズ リサーチ
  • 図書館情報学

フィンガープリント

「Extracting representative subset from extensive text data for training pre-trained language models」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル