Abstract
Assessing scientific argumentation is one of main challenges in science education. Constructed-response (CR) items can be used to measure the coherence of student ideas and inform science instruction on argumentation. Published research on automated scoring of CR items has been conducted mostly in English writing, rarely in other languages. The objective of this study is to investigate issues related to the automated scoring of Chinese written responses. LightSIDE was used to score students’ written responses in Chinese. The sample of this study was from Beijing (grades 7–9) consisting of 4000 students. Items for assessing interpreting data and making claims under an ecological topic developed by the Stanford NGSS Assessment Project were translated into Chinese and used to assess student competence of interpreting data and making claims. The results show that: (1) at least 800 human-scored student responses were needed as the training sample size to accurately build scoring models. When doubling the training sample size, the accuracy in kappa increased only slightly by 0.03–0.04; (2) there was a nearly perfect agreement between human scoring and computer-automated scoring based on both holistic scores and analytic scores, although analytic scores produced slightly better accuracy than holistic scores; (3) automated scoring accuracy did not differ substantially by student response length, although shorter text length produced slightly higher human-machine agreement. The above findings suggest that automated scoring of Chinese writings produced a similar level of accuracy compared with that of English writings reported in literature, although there are specific considerations, e.g., training data set size, scoring rubric, and text lengths, to be considered using automated scoring of student written responses in Chinese.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abrahams, I., & Millar, R. (2008). Does practical work really work? A study of the effectiveness of practical work as a teaching and learning method in school science. International Journal of Science Education, 30, 1945–1969, 14.
Attali, Y. (2007, 2007). Construct validity of e-rater® in scoring TOEFL® essays. ETS Research Report Series, i–22.
Bacon, D. R. (2003). Assessing learning outcomes: a comparison of multiple-choice and short-answer questions in a marketing context. Journal of Marketing Education, 25, 31–36, 1.
Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: how closely do they match clinical interview performance? Journal of Science Education and Technology, 23, 160–182, 1.
Berland, L. K., & McNeill, K. L. (2010). A learning progression for scientific argumentation: understanding student work and designing supportive instructional contexts. Science Education, 94, 765–793, 5.
Carter, K. (2019). Investigating student conceptual understanding of structure and function by using formative assessment and automated scoring models.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: evaluating e-rater®‘s performance on TOEFL® essays. ETS Research Report Series, 2004, i–38.
Cohen, J. (1968). Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220, 4.
D’Souza, A. N. (2017). Enhancing and evaluating scientific argumentation in the inquiry oriented college chemistry classroom (Ph.D., City University of New York). City University of New York, United States -- New York.
Duschl, R. (2008). Science education in three-part harmony: balancing conceptual, epistemic, and social learning goals. Review of Research in Education, 32(1), 268–291.
Duschl, Schweingruber, H. A., & Shouse, A. W. (2007). Taking science to school: learning and teaching science in grades K-8. In National Academies Press. National Academies Press.
Erduran, S., & Jiménez-Aleixandre, M. P. (2008). Argumentation in science education. Springer.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3), 613–619.
Gianfortoni, P., Adamson, D., & Ros, C. P. (2011). Modeling of stylistic variation in social media with stretchy patterns 11.
Gibson, J. P., & Mourad, T. (2018). The growing importance of data literacy in life science education. American Journal of Botany, 105(12), 1953–1956.
Gierl, M. J., Latifi, S., Lai, H., Boulais, A.-P., & Champlain, A. D. (2014). Automated essay scoring and the future of educational assessment in medical education. Medical Education, 48(10), 950–962.
Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-metrix: providing multilevel analyses of text characteristics. Educational Research, 40(5), 223–234.
Ha, M., Nehm, R. H., Urban-Lurain, M., & Merrill, J. E. (2011). Applying computerized-scoring models of written biological explanations across courses and colleges: prospects and limitations. CBE Life Sciences Education, 10, 379–393, 4.
Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 81–85. Denver, Colorado: Association for Computational Linguistics.
Henderson, J. B., McNeill, K. L., González-Howard, M., Close, K., & Evans, M. (2018). Key challenges and future directions for educational research on scientific argumentation: challenges for scientific argumentation research. Journal of Research in Science Teaching, 55, 5–18, 1.
Jescovitch, L. N., Doherty, J. H., Scott, E. E., Cerchiara, J. A., Wenderoth, M. P., Urban-Lurain, M., … Haudek, K. C. (2019). Challenges in developing computerized scoring models for principle-based reasoning in a physiology context. 23.
Jiménez-Aleixandre, M. P., Rodríguez, A. B., & Duschl, R. A. (2000). “Doing the lesson” or “doing science”: argument in high school genetics. Science Education, 84(6), 757–792.
Jurka, T.,. P., Collingwood, L., Boydstun, A. E., Grossman, E., & van Atteveldt, W. (2013). RTextTools: a supervised learning package for text classification. The R Journal, 5(1), 6.
Kuechler, W. L., & Simkin, M. G. (2010). Why is performance on multiple-choice tests and constructed-response tests not more closely related? Theory and an empirical test. Decision Sciences Journal of Innovative Education, 8, 55–73, 1.
Kuo, C.-Y., & Wu, H.-K. (2013). Toward an integrated model for designing assessment systems: an analysis of the current status of computer-based assessments in science. Computers in Education, 68, 388–403.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Latour, B. (1987). Science in action: how to follow scientists and engineers through society. Harvard university press.
Leacock, C., & Chodorow, M. (2003). C-rater: automated scoring of short-answer questions. Computers and the Humanities, 37, 389–405, 4.
Lee, H.-S., Liu, O. L., & Linn, M. C. (2011). Validating measurement of knowledge integration in science using multiple-choice and explanation items. Applied Measurement in Education, 24(2), 115–136.
Lee, H.-S., Pallant, A., Pryputniewicz, S., & Lord, T. (2019). Automated text scoring and real-time adjustable feedback: supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590–622.
Li, X., Meng, Y., Sun, X., Han, Q., Yuan, A., & Li, J. (2019). Is word segmentation necessary for deep learning of Chinese representations? ArXiv:1905.05526 [Cs]. Presented at the ACL. Retrieved from http://arxiv.org/abs/1905.05526
Liao, C.-H., Kuo, B.-C., & Pai, K.-C. (2012). Effectiveness of automated Chinese sentence scoring with latent semantic analysis. Turkish Online Journal of Educational Technology - TOJET, 11, 80–87.
Liu, X. (2020). Using and developing measurement instruments in science education: a Rasch modeling approach 2nd edition. IAP.
Liu, O. L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M. C. (2014). Automated scoring of constructed-response science items: prospects and obstacles. Educational Measurement: Issues and Practice, 33, 19–28, 2.
Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments: automated scoring of science assessment. Journal of Research in Science Teaching, 53(2), 215–233.
Lovett, M. C., & Shah, P. (2007). Thinking with data (0th ed.). Psychology Press.
Manz, E. (2016). Examining evidence construction as the transformation of the material world into community knowledge: examining evidence construction. Journal of Research in Science Teaching, 53, 1113–1140, 7.
Mao, L., Liu, O. L., Roohr, K., Belur, V., Mulholland, M., Lee, H.-S., & Pallant, A. (2018). Validation of automated scoring for a formative assessment that employs scientific argumentation. Educational Assessment, 23, 121–138, 2.
Mayfield, E., Adamson, D., & Rosé, C. P. (2014). Researcher’s workbench user manual 55.
McNeill, K. L., Lizotte, D. J., Krajcik, J., & Marx, R. W. (2006). Supporting students’ construction of scientific explanations by fading scaffolds in instructional materials. The Journal of the Learning Sciences, 15(2), 153–191.
Ministry of Education, P. R. China. (2017a). Biology curriculum standards for senior high school (in Chinese) [普通高中生物课程标准]. Beijing: People’s Education Press.
Ministry of Education, P. R. China. (2017b). Chemistry curriculum standards for senior high school (in Chinese) [普通高中地理课程标准]. Beijing: People’s Education Press.
Ministry of Education, P. R. China. (2017c). Geography curriculum standards for senior high school (in Chinese) [普通高中地理课程标准]. Beijing: People’s Education Press.
Ministry of Education, P. R. China. (2017d). Physics curriculum standards for senior high school (in Chinese) [普通高中物理课程标准]. Beijing: People’s Education Press.
Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7, 15.
National Research Council. (2012). A framework for K-12 science education: practices, crosscutting concepts, and core ideas. National Academies Press.
Nehm, R. H., & Haertig, H. (2012). Human vs. Computer diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software Journal of Science Education and Technology, 21, 56–73.
Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21, 183–196, 1.
Newton, P., Driver, R., & Osborne, J. (1999). The place of argumentation in the pedagogy of school science. International Journal of Science Education, 21(5), 553–576.
NGSS Lead States. (2013). Next generation science standards: for states, by states. Washington, D.C.: National Academies Press.
Osborne, J. F., Henderson, J. B., MacPherson, A., Szu, E., Wild, A., & Yao, S.-Y. (2016). The development and validation of a learning progression for argumentation in science: learning progression for argumentation in science. Journal of Research in Science Teaching, 53(6), 821–846.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825–2830.
Risse, T. (2007). Testing and assessing mathematical skills by a script based system. 6 pages. Kassel University Press.
Sandoval, W. A., & Çam, A. (2011). Elementary children’s judgments of the epistemic status of sources of justification. Science Education, 95(3), 383–408.
Shermis, M. D., & Burstein, J. (2013). Handbook of automated essay evaluation: current applications and new directions. Routledge.
Sukkarieh, J. Z., & Blackmore, J. (2009). C-rater: automatic content scoring for short constructed responses. 6.
Weston, M., Parker, J., & Urban-Lurain, M. (2013). Comparing Formative Feedback Reports: Human and automated text analysis of constructed response questions in biology. Presented at the NARST 2013. Retrieved from https://create4stem.msu.edu/sites/default/files/biblio_attachments/Weston_2013NARST_paper.pdf
Yang, J., Zhang, Y., & Liang, S. (2018). Subword encoding in lattice LSTM for Chinese word segmentation. ArXiv:1810.12594 [Cs]. Retrieved from http://arxiv.org/abs/1810.12594
Zembal-Saul, C., McNeill, K. L., & Hershberger, K. (2013). What’s your evidence?: engaging K-5 children in constructing explanations in science. Pearson Higher Ed.
Zhai, X. (2019). Applying machine learning in science assessment: Opportunity and challenge. Journal of Science Education and Technology, 1–4.
Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 56(1), 111–151.
Zhu, M., Lee, H.-S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648–1668.
Funding
This study was funded by the International Joint Research Project of Faculty of Education, Beijing Normal University. Cong Wang was supported by China Scholarship Council (CSC) Grant #201806040088.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Ethical Approval
The authors declare that there was no ethical violation.
Informed Consent
The authors declare that there was no violation of informed consent.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, C., Liu, X., Wang, L. et al. Automated Scoring of Chinese Grades 7–9 Students’ Competence in Interpreting and Arguing from Evidence. J Sci Educ Technol 30, 269–282 (2021). https://doi.org/10.1007/s10956-020-09859-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10956-020-09859-z