[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3493700.3493765acmconferencesArticle/Chapter ViewAbstractPublication PagescomadConference Proceedingsconference-collections
research-article

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

Published: 08 January 2022 Publication History

Abstract

Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Yet, the AI-based testing literature of these ‘intelligent‘ models is highly lacking as most of the papers propose new models that rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, etc. and to date, there has been no study testing Automated Essay Scoring (AES) systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems to test their natural language understanding capabilities and overall robustness. We evaluate the current state-of-the-art AES models using the proposed scheme and report the results on five recent models. These models range from feature-engineering-based approaches to the latest deep learning algorithms. We find that AES models are highly overstable such that even heavy modifications (as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models. On the other hand, unrelated content, on average, increases the scores, thus showing that the models’ evaluation strategy and rubrics should be reconsidered. We conduct a human survey with 200 human raters and observe that they can easily detect differences between the original and perturbed responses and have a general disagreement with the scores assigned by auto scorers.

References

[1]
ASAP-AES. 2012. The Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays.https://www.kaggle.com/c/asap-aes/.
[2]
Yigal Attali and Jill Burstein. 2004. Automated essay scoring with e-rater® v. 2.0. ETS Research Report Series 2004, 2 (2004), i–21.
[3]
Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based approach. Computational Linguistics 34, 1 (2008), 1–34.
[4]
Isaac I Bejar, Robert J Mislevy, and Mo Zhang. 2017. Automated scoring with validity in mind. The wiley handbook of cognition and assessment: Frameworks, Methodologies, and applications(2017), 226–246.
[5]
J Bryant, C Heitz, S Sanghvi, and D Wagle. 2020. How artificial intelligence will impact K-12 teachers. Retrieved May 12(2020), 2020.
[6]
Jill Burstein, Martin Chodorow, and Claudia Leacock. 2004. Automated essay evaluation: The Criterion online writing service. Ai magazine 25, 3 (2004), 27–27.
[7]
Lei Chen, Jidong Tao, Shabnam Ghaffarzadegan, and Yao Qian. 2018. End-to-end neural network based automated speech scoring. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6234–6238.
[8]
Sara Cushing Weigle. 2010. Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing 27, 3 (2010), 335–353.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[10]
The Indian Education Diary. 2020. The Open University Of China Awarded UNESCO Prize For Its Use Of AI To Empower Rural Learners. https://indiaeducationdiary.in/the-open-university-of-china-awarded-unesco-prize-for-its-use-of-ai-to-empower-rural-learners/.
[11]
Yuning Ding, Brian Riordan, Andrea Horbach, Aoife Cahill, and Torsten Zesch. 2020. Don’t take “nswvtnvakgxpm” for an answer–The surprising vulnerability of automatic content scoring systems to adversarial input. In Proceedings of the 28th International Conference on Computational Linguistics. 882–892.
[12]
Afrizal Doewes and Mykola Pechenizkiy. 2021. On the Limitations of Human-Computer Agreement in Automated Essay Scoring. In EDM.
[13]
Edx EASE. 2013. EASE (Enhanced AI Scoring Engine) is a library that allows for machine learning based classification of textual content. This is useful for tasks such as scoring student essays.https://github.com/edx/ease.
[14]
ETS. [n.d.]. Automated Scoring. What it is and why it’s a big deal. https://news.ets.org/stories/automated-scoring/.
[15]
Todd Feathers. 2019. Flawed Algorithms Are Grading Millions of Students’ Essays. https://www.vice.com/en/article/pa7dj9/flawed-algorithms-are-grading-millions-of-students-essays.
[16]
Peter W Foltz, Lynn A Streeter, Karen E Lochbaum, and Thomas K Landauer. 2013. Implementation and applications of the Intelligent Essay Assessor. Handbook of automated essay evaluation(2013), 68–88.
[17]
Peter Greene. 2018. Automated essay scoring remains an empty dream. Retrieved from Forbes: https://www. forbes. com/sites/petergreene/2018/07/02/automated-essay-scoring-remains-an-empty-dream(2018).
[18]
Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK.
[19]
Derrick Higgins and Michael Heilman. 2014. Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior. Educational Measurement: Issues and Practice 33, 3 (2014), 36–46.
[20]
Sungho Jeon and Michael Strube. 2020. Centering-based Neural Coherence Modeling with Hierarchical Discourse Segments. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7458–7472. https://doi.org/10.18653/v1/2020.emnlp-main.604
[21]
Zixuan Ke, Winston Carlile, Nishant Gurrapadi, and Vincent Ng. 2018. Learning to Give Feedback: Modeling Attributes Affecting Argument Persuasiveness in Student Essays. In IJCAI. 4130–4136.
[22]
Zixuan Ke and Vincent Ng. 2019. Automated Essay Scoring: A Survey of the State of the Art. In IJCAI, Vol. 19. 6300–6308.
[23]
Beata Beigman Klebanov and Nitin Madnani. 2020. Automated evaluation of writing–50 years and counting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7796–7810.
[24]
Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Ratn Shah, Ponnurangam Kumaraguru, and Roger Zimmermann. 2019. Get IT Scored Using AutoSAS—An Automated System for Scoring Short Answers. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9662–9669.
[25]
Jiawei Liu, Yang Xu, and Yaguang Zhu. 2019. Automated essay scoring based on two-stage learning. arXiv preprint arXiv:1901.07744(2019).
[26]
Karen E Lochbaum, Mark Rosenstein, Peter Foltz, Marcia A Derr, 2013. Detection of gaming in automated scoring of essays with the IEA. In National Council on Measurement in Education Conference (NCME), San Francisco, CA.
[27]
Nitin Madnani and Aoife Cahill. 2018. Automated scoring: Beyond natural language processing. In Proceedings of the 27th International Conference on Computational Linguistics. 1099–1109.
[28]
James Martin. 1983. Managing the data base environment. Prentice Hall PTR.
[29]
Sandeep Mathias and Pushpak Bhattacharyya. 2018. ASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay Attribute Scores. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1187
[30]
Dong Nguyen. 2018. Comparing Automatic and Human Evaluation of Local Explanations for Text Classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1069–1078. https://doi.org/10.18653/v1/N18-1097
[31]
Patrick O’Donnell. 2020. Computers are now grading essays on Ohio’s state tests. https://www.cleveland.com/metro/2018/03/computers_are_now_grading_essays_on_ohios_state_tests_your_ch.html.
[32]
Swapnil Parekh, Yaman Kumar Singla, Changyou Chen, Junyi Jessy Li, and Rajiv Ratn Shah. 2020. My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism. arXiv preprint arXiv:2012.13872(2020).
[33]
Rajaswa Patil, Yaman Kumar Singla, Rajiv Ratn Shah, Mika Hama, and Roger Zimmermann. 2020. Towards Modelling Coherence in Spoken Discourse. arXiv preprint arXiv:2101.00056(2020).
[34]
PEG. 2017. The Engine Driving Automated Essay Scoring. https://utahcompose.com/sites/default/files/peg-Info-report.pdf.
[35]
Les Perelman. 2014. When “the state of the art” is counting words. Assessing Writing 21(2014), 104–111.
[36]
Les Perelman, Louis Sobel, Milo Beckman, and Damien Jiang. 2014. Basic Automatic B.S. Essay Language Generator (BABEL). https://babel-generator.herokuapp.com/.
[37]
Les Perelman, Louis Sobel, Milo Beckman, and Damien Jiang. 2014. Basic Automatic B.S. Essay Language Generator (BABEL) by Les Perelman, Ph.D.http://lesperelman.com/writing-assessment-robo-grading/babel-generator/.
[38]
Donald E Powers, Jill C Burstein, Martin Chodorow, Mary E Fowles, and Karen Kukich. 2001. Stumping E-Rater: Challenging the validity of automated essay scoring. ETS Research Report Series 2001, 1 (2001), i–44.
[39]
Nathanael Reinertsen 2018. Why can’t it mark this one?: A qualitative analysis of student writing rejected by an automated essay scoring system. English in Australia 53, 1 (2018), 52.
[40]
Matthew T Schultz. 2013. The intellimetric automated essay scoring engine-a review and an application to chinese essay scoring. New York: Routledge.
[41]
Jui Shah, Yaman Kumar Singla, Changyou Chen, and Rajiv Ratn Shah. 2021. What all do audio transformer models hear? probing acoustic representations for language delivery and its structure. arXiv preprint arXiv:2101.00387(2021).
[42]
Arushi Sharma, Anubha Kabra, and Rajiv Kapoor. 2021. Feature Enhanced Capsule Networks for Robust Automatic Essay Scoring. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 365–380.
[43]
Mark D Shermis and Ben Hamner. 2012. Contrasting state-of-the-art automated scoring of essays: Analysis. In Annual national council on measurement in education meeting. 14–16.
[44]
Yaman Kumar Singla, Avykat Gupta, Shaurya Bagga, Changyou Chen, Balaji Krishnamurthy, and Rajiv Ratn Shah. 2021. Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring. arXiv preprint arXiv:2109.00928(2021).
[45]
Yaman Kumar Singla, Swapnil Parekh, Somesh Singh, Junyi Jessy Li, Rajiv Ratn Shah, and Changyou Chen. 2021. AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses. arXiv preprint arXiv:2109.11728(2021).
[46]
Tovia Smith. 2018. More states opting to’robo-grade’student essays by computer. Retrieved from NPR: https://www. npr. org/2018/06/30/624373367/more-states-opting-to-robo-grade-student-essays-by-computer(2018).
[47]
Makesh Narsimhan Sreedhar, Kun Ni, and Siva Reddy. 2020. Learning Improvised Chatbots from Adversarial Modifications of Natural Language Feedback. arXiv preprint arXiv:2010.07261(2020).
[48]
Jana Zuheir Sukkarieh and John Blackmore. 2009. C-rater: Automatic content scoring for short constructed responses. In Twenty-Second International FLAIRS Conference.
[49]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199(2013).
[50]
Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing. 1882–1891.
[51]
Jinlan Tang and Changhua Sun Rich. 2017. Automated writing evaluation in an EFL setting: Lessons from China.JALT CALL Journal 13, 2 (2017), 117–146.
[52]
Yi Tay, Minh C Phan, Luu Anh Tuan, and Siu Cheung Hui. 2018. SkipFlow: Incorporating neural coherence features for end-to-end automatic text scoring. In Thirty-Second AAAI Conference on Artificial Intelligence.
[53]
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial triggers for attacking and analyzing NLP. arXiv preprint arXiv:1908.07125(2019).
[54]
Patti West-Smith, Stephanie Butler, and Elijah Mayfield. 2018. Trustworthy Automated Essay Scoring without Explicit Construct Validity. In 2018 AAAI Spring Symposium Series.
[55]
Peng Xu, Hamidreza Saghir, Jin Sung Kang, Teng Long, Avishek Joey Bose, Yanshuai Cao, and Jackie Chi Kit Cheung. 2019. A cross-domain transferable neural coherence model. arXiv preprint arXiv:1905.11912(2019).
[56]
Duanli Yan, André A Rupp, and Peter W Foltz. 2020. Handbook of automated scoring: Theory into practice. CRC Press.
[57]
Shanliang Yang, Qi Sun, Huyong Zhou, Zhengjie Gong, Yangzhi Zhou, and Junhong Huang. 2018. A topic detection method based on KeyGraph and community partition. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence. 30–34.
[58]
Su-Youn Yoon, Aoife Cahill, Anastassia Loukina, Klaus Zechner, Brian Riordan, and Nitin Madnani. 2018. Atypical Inputs in educational applications. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). 60–67.
[59]
Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. 2020. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 3(2020), 1–41.
[60]
Siyuan Zhao, Yaqiong Zhang, Xiaolu Xiong, Anthony Botelho, and Neil Heffernan. 2017. A memory-augmented neural model for automated grading. In Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale. 189–192.

Cited By

View all
  • (2024)A Systematic Literature Review: Are Automated Essay Scoring Systems Competent in Real-Life Education Scenarios?IEEE Access10.1109/ACCESS.2024.339916312(77639-77657)Online publication date: 2024

Index Terms

  1. Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CODS-COMAD '22: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)
    January 2022
    357 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 January 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Center for Design and New Media, Indraprastha Institute of Information Technology
    • Infosys Centre for Artificial Intelligence, Indraprastha institute of Information Technology

    Conference

    CODS-COMAD 2022
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Systematic Literature Review: Are Automated Essay Scoring Systems Competent in Real-Life Education Scenarios?IEEE Access10.1109/ACCESS.2024.339916312(77639-77657)Online publication date: 2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media