research-article

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

Authors:

Rajiv Ratn ShahAuthors Info & Claims

CODS-COMAD '22: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)

Pages 90 - 99

https://doi.org/10.1145/3493700.3493765

Published: 08 January 2022 Publication History

Get Access

Abstract

Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Yet, the AI-based testing literature of these ‘intelligent‘ models is highly lacking as most of the papers propose new models that rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, etc. and to date, there has been no study testing Automated Essay Scoring (AES) systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems to test their natural language understanding capabilities and overall robustness. We evaluate the current state-of-the-art AES models using the proposed scheme and report the results on five recent models. These models range from feature-engineering-based approaches to the latest deep learning algorithms. We find that AES models are highly overstable such that even heavy modifications (as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models. On the other hand, unrelated content, on average, increases the scores, thus showing that the models’ evaluation strategy and rubrics should be reconsidered. We conduct a human survey with 200 human raters and observe that they can easily detect differences between the original and perturbed responses and have a general disagreement with the scores assigned by auto scorers.

References

[1]

ASAP-AES. 2012. The Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays.https://www.kaggle.com/c/asap-aes/.

Abstract

References

Cited By

Index Terms

Recommendations

Automatic Essay Scoring: Design and Implementation of Automatic Amharic Essay Scoring System Using Latent Semantic Analysis

Synonym-Based Essay Generation and Augmentation for Robust Automatic Essay Scoring

Automatic Chinese Essay Scoring Using Connections between Concepts in Paragraphs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations