Massive Exploration of Pseudo Data for Grammatical Error Correction

Shun Kiyono, Jun Suzuki, Tomoya Mizumoto, Kentaro Inui

Research output: Contribution to journalArticlepeer-review

11 Citations (Scopus)


Collecting a large amount of training data for grammatical error correction (GEC) models has been an ongoing challenge in the field of GEC. Recently, it has become common to use data demanding deep neural models such as an encoder-decoder for GEC; thus, tackling the problem of data collection has become increasingly important. The incorporation of pseudo data in the training of GEC models is one of the main approaches for mitigating the problem of data scarcity. However, a consensus is lacking on experimental configurations, namely, (i) the methods for generating pseudo data, (ii) the seed corpora used as the source of the pseudo data, and (iii) the means of optimizing the model. In this study, these configurations are thoroughly explored through massive amount of experiments, with the aim of providing an improved understanding of pseudo data. Our main experimental finding is that pretraining a model with pseudo data generated by back-translation-based method is the most effective approach. Our findings are supported by the achievement of state-of-the-art performance on multiple benchmark test sets (the CoNLL-2014 test set and the official test set of the BEA-2019 shared task) without requiring any modifications to the model architecture. We also perform an in-depth analysis of our model with respect to the grammatical error type and proficiency level of the text. Finally, we suggest future directions for further improving model performance.

Original languageEnglish
Article number9134890
Pages (from-to)2134-2145
Number of pages12
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Publication statusPublished - 2020


  • grammars and other rewriting systems
  • language generation
  • machine translation
  • Natural language processing


Dive into the research topics of 'Massive Exploration of Pseudo Data for Grammatical Error Correction'. Together they form a unique fingerprint.

Cite this