TY - GEN
T1 - Preventing critical scoring errors in short answer scoring with confidence estimation
AU - Funayama, Hiroaki
AU - Sasaki, Shota
AU - Matsubayashi, Yuichiroh
AU - Mizumoto, Tomoya
AU - Suzuki, Jun
AU - Mita, Masato
AU - Inui, Kentaro
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Number JP 19H04162 and 19K12112. This work was also partially supported by Bilateral Joint Research Program between RIKEN AIP Center and Tohoku University. We would like to thank the anonymous reviewers for their insightful comments. We also appreciate Takamiya Gakuen Yoyogi Seminar for providing the data.
Publisher Copyright:
© 2020 Association for Computational Linguistics.
PY - 2020
Y1 - 2020
N2 - Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments in our new task formulation and demonstrate that a typical SAS system can predict scores with zero CSE for approximately 50% of test data at maximum by filtering out low-reliablility predictions on the basis of a certain confidence estimation. This result directly indicates the possibility of reducing half the scoring cost of human raters, which is more preferable for the evaluation of SAS systems.
AB - Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments in our new task formulation and demonstrate that a typical SAS system can predict scores with zero CSE for approximately 50% of test data at maximum by filtering out low-reliablility predictions on the basis of a certain confidence estimation. This result directly indicates the possibility of reducing half the scoring cost of human raters, which is more preferable for the evaluation of SAS systems.
UR - http://www.scopus.com/inward/record.url?scp=85117887556&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117887556&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85117887556
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 237
EP - 243
BT - ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Student Research Workshop, SRW 2020
Y2 - 5 July 2020 through 10 July 2020
ER -