TY - JOUR
T1 - Corruption Is Not All Bad
T2 - Incorporating Discourse Structure into Pre-Training via Corruption for Essay Scoring
AU - Mim, Farjana Sultana
AU - Inoue, Naoya
AU - Reisert, Paul
AU - Ouchi, Hiroki
AU - Inui, Kentaro
N1 - Funding Information:
Manuscript received October 21, 2020; revised May 3, 2021; accepted May 31, 2021. Date of publication June 10, 2021; date of current version July 2, 2021. This work was supported in part by JSPS KAKENHI under Grant 19K20332 and in part by JST CREST under Grant JPMJCR20D2. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Zoraida Callejas. (Corresponding author: Farjana Sultana Mim.) Farjana Sultana Mim is with the Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 980-8579, Japan (e-mail: farjana.mim59@gmail.com).
Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - Existing approaches for automated essay scoring and document representation learning typically rely on discourse parsers to incorporate discourse structure into text representation. However, the performance of parsers is not always adequate, especially when they are used on noisy texts, such as student essays. In this paper, we propose an unsupervised pre-training approach to capture discourse structure of essays in terms of coherence and cohesion that does not require any discourse parser or annotation. We introduce several types of token, sentence and paragraph-level corruption techniques for our proposed pre-training approach and augment masked language modeling pre-training with our pre-training method to leverage both contextualized and discourse information. Our proposed unsupervised approach achieves a new state-of-the-art result on the task of essay Organization scoring.
AB - Existing approaches for automated essay scoring and document representation learning typically rely on discourse parsers to incorporate discourse structure into text representation. However, the performance of parsers is not always adequate, especially when they are used on noisy texts, such as student essays. In this paper, we propose an unsupervised pre-training approach to capture discourse structure of essays in terms of coherence and cohesion that does not require any discourse parser or annotation. We introduce several types of token, sentence and paragraph-level corruption techniques for our proposed pre-training approach and augment masked language modeling pre-training with our pre-training method to leverage both contextualized and discourse information. Our proposed unsupervised approach achieves a new state-of-the-art result on the task of essay Organization scoring.
KW - Automated Essay Scoring
KW - Coherence
KW - Cohesion
KW - Corruption
KW - Discourse
KW - Natural Language Processing
KW - Pre-training
KW - Unsupervised Learning
UR - http://www.scopus.com/inward/record.url?scp=85112160103&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112160103&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2021.3088223
DO - 10.1109/TASLP.2021.3088223
M3 - Article
AN - SCOPUS:85112160103
SN - 2329-9290
VL - 29
SP - 2202
EP - 2215
JO - IEEE/ACM Transactions on Speech and Language Processing
JF - IEEE/ACM Transactions on Speech and Language Processing
M1 - 9451631
ER -