JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Makoto Morishita, Katsuki Chousa, Jun Suzuki, Masaaki Nagata

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available, compared to such resource-rich languages as English-German. It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0. Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus. Through experiments, we empirically show how our new corpus boosts the accuracy of machine translation models on various domains. The JParaCrawl v3.0 corpus will eventually be publicly available online for research purposes.

Original languageEnglish
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages6704-6710
Number of pages7
ISBN (Electronic)9791095546726
Publication statusPublished - 2022
Externally publishedYes
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: 2022 Jun 202022 Jun 25

Publication series

Name2022 Language Resources and Evaluation Conference, LREC 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period22/6/2022/6/25

Keywords

  • English
  • Japanese
  • machine translation
  • parallel corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language
  • Education

Fingerprint

Dive into the research topics of 'JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus'. Together they form a unique fingerprint.

Cite this