TY - GEN
T1 - Cross-corpora evaluation and analysis of grammatical error correction models - Is single-corpus evaluation enough?
AU - Mita, Masato
AU - Mizumoto, Tomoya
AU - Kaneko, Masahiro
AU - Nagata, Ryo
AU - Inui, Kentaro
N1 - Funding Information:
We are grateful to the members of the Tohoku University Natural Language Processing Laboratory as well as the anonymous reviewers for their insightful comments and suggestions.
Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2019
Y1 - 2019
N2 - This study explores the necessity of performing cross-corpora evaluation for grammatical error correction (GEC) models. GEC models have been previously evaluated based on a single commonly applied corpus: the CoNLL-2014 benchmark. However, the evaluation remains incomplete because the task difficulty varies depending on the test corpus and conditions such as the proficiency levels of the writers and essay topics. To overcome this limitation, we evaluate the performance of several GEC models, including NMT-based (LSTM, CNN, and transformer) and an SMT-based model, against various learner corpora (CoNLL-2013, CoNLL-2014, FCE, JFLEG, ICNALE, and KJ). Evaluation results reveal that the models' rankings considerably vary depending on the corpus, indicating that single-corpus evaluation is insufficient for GEC models.
AB - This study explores the necessity of performing cross-corpora evaluation for grammatical error correction (GEC) models. GEC models have been previously evaluated based on a single commonly applied corpus: the CoNLL-2014 benchmark. However, the evaluation remains incomplete because the task difficulty varies depending on the test corpus and conditions such as the proficiency levels of the writers and essay topics. To overcome this limitation, we evaluate the performance of several GEC models, including NMT-based (LSTM, CNN, and transformer) and an SMT-based model, against various learner corpora (CoNLL-2013, CoNLL-2014, FCE, JFLEG, ICNALE, and KJ). Evaluation results reveal that the models' rankings considerably vary depending on the corpus, indicating that single-corpus evaluation is insufficient for GEC models.
UR - http://www.scopus.com/inward/record.url?scp=85085553795&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085553795&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85085553795
T3 - NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
SP - 1309
EP - 1314
BT - Long and Short Papers
PB - Association for Computational Linguistics (ACL)
T2 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019
Y2 - 2 June 2019 through 7 June 2019
ER -