[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3657604.3662036acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesl-at-sConference Proceedingsconference-collections
research-article
Open access

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Published: 15 July 2024 Publication History

Abstract

The integration of AI assistants, especially through the development of Large Language Models (LLMs), into computer science education has sparked significant debate, highlighting both their potential to augment student learning and the risks associated with their misuse. An emerging body of work has looked into using LLMs in education, primarily focusing on evaluating the performance of existing models or conducting short-term human subject studies. However, very little work has examined the impacts of LLM-powered assistants on students in entry-level programming courses, particularly in real-world contexts and over extended periods. To address this research gap, we conducted a semester-long, between-subjects study with 50 students using CodeTutor, an LLM-powered assistant developed by our research team. Our study results show that students who used CodeTutor (the "CodeTutor group" as the experimental group) achieved statistically significant improvements in their final scores compared to peers who did not use the tool (the "control group"). Within the CodeTutor group, those without prior experience with LLM-powered tools demonstrated significantly greater performance gain than their counterparts. We also found that students expressed positive feedback regarding CodeTutor's capability to comprehend their queries and assist in learning programming language syntax. However, they had concerns about CodeTutor's limited role in developing critical thinking skills. Over the course of the semester, students' agreement with CodeTutor's suggestions decreased, with a growing preference for support from traditional human teaching assistants. Our findings also show that students turned to CodeTutor for different tasks, including programming task completion, syntax comprehension, and debugging, particularly seeking help for programming assignments. Our analysis further reveals that the quality of user prompts was significantly correlated with CodeTutor's response effectiveness. Building upon these results, we discuss the implications of our findings for the need to integrate Generative AI literacy into curricula to foster critical thinking skills, and turn to examining the temporal dynamics of user engagement with LLM-powered tools. We further discuss the discrepancy between the anticipated functions of tools and students' actual capabilities, which sheds light on the need for tailored strategies to improve educational outcomes.

References

[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774
[2]
Toufique Ahmed, Noah Rose Ledesma, and Premkumar Devanbu. 2022. SYNSHINE: improved fixing of syntax errors. IEEE Transactions on Software Engineering, Vol. 49, 4 (2022), 2169--2181. https://doi.org/10.1109/TSE.2022.3212635
[3]
John R Anderson, C Franklin Boyle, and Brian J Reiser. 1985. Intelligent tutoring systems. Science, Vol. 228, 4698 (1985), 456--462. https://doi.org/10.1126/science.228.4698.456
[4]
Douglas Bates, Martin M"achler, Ben Bolker, and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, Vol. 67, 1 (2015), 1--48. https://doi.org/10.18637/jss.v067.i01
[5]
Peter Brusilovsky et al. 1998. Adaptive educational systems on the world-wide-web: A review of available technologies. In Proceedings of Workshop" WWW-Based Tutoring" at 4th International Conference on Intelligent Tutoring Systems (ITS'98), San Antonio, TX.
[6]
Peter Brusilovsky, Elmar Schwarz, and Gerhard Weber. 1996. ELM-ART: An intelligent tutoring system on World Wide Web. In Intelligent Tutoring Systems: Third International Conference, ITS'96 Montréal, Canada, June 12--14, 1996 Proceedings 3. Springer, 261--269. https://doi.org/10.1007/3--540--61327--7_123
[7]
Cory J Butz, Shan Hua, and R Brien Maguire. 2006. A web-based bayesian intelligent tutoring system for computer programming. Web Intelligence and Agent Systems: An International Journal, Vol. 4, 1 (2006), 77--97.
[8]
Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. 2023. The future landscape of large language models in medicine. Communications Medicine, Vol. 3, 1 (2023), 141. https://doi.org/10.1038/s43856-023-00370--1
[9]
Albert T Corbett, Kenneth R Koedinger, and John R Anderson. 1997. Intelligent tutoring systems. In Handbook of human-computer interaction. Elsevier, 849--874. https://doi.org/10.1016/B978-044481862--1.50103--5
[10]
Dorottya Demszky and Jing Liu. 2023. M-Powering Teachers: Natural Language Processing Powered Feedback Improves 1: 1 Instruction and Student Outcomes. (2023). https://doi.org/10.1145/3573051.3593379
[11]
Paul Denny, Sami Sarsa, Arto Hellas, and Juho Leinonen. 2022. Robosourcing Educational Resources--Leveraging Large Language Models for Learnersourcing. arXiv preprint arXiv:2211.04715 (2022). https://doi.org/10.1145/3501385.3543957
[12]
Felix Dobslaw and Peter Bergh. 2023. Experiences with Remote Examination Formats in Light of GPT-4. arXiv preprint arXiv:2305.02198 (2023). https://doi.org/10.48550/arXiv.2305.02198
[13]
Gilan M El Saadawi, Eugene Tseytlin, Elizabeth Legowski, Drazen Jukic, Melissa Castine, Jeffrey Fine, Robert Gormley, and Rebecca S Crowley. 2008. A natural language intelligent tutoring system for training pathologists: Implementation and evaluation. Advances in health sciences education, Vol. 13 (2008), 709--722. https://doi.org/10.1007/s10459-007--9081--3
[14]
Mark Elsom-Cook. 1984. Design considerations of an intelligent tutoring system for programming languages. Ph.,D. Dissertation. University of Warwick.
[15]
GitHub, Inc. 2024. GitHub Copilot. https://github.com/features/copilot. Accessed: 2024-02--11.
[16]
Arthur C Graesser, Xiangen Hu, and Robert Sottilare. 2018. Intelligent tutoring systems. In International handbook of the learning sciences. Routledge, 246--255.
[17]
Morgan Gustafson. 2022. The Effect of Homework Completion on Students' Academic Performance. Dissertations, Theses, and Projects. https://red.mnstate.edu/thesis/662 662.
[18]
Yann Hicke, Anmol Agarwal, Qianou Ma, and Paul Denny. 2023. ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs. arXiv preprint arXiv:2311.02775 (2023). https://doi.org/10.48550/arXiv.2311.02775
[19]
Danial Hooshyar, Rodina Binti Ahmad, Moslem Yousefi, Farrah Dina Yusop, and S-J Horng. 2015. A flowchart-based intelligent tutoring system for improving problem-solving skills of novice programmers. Journal of computer assisted learning, Vol. 31, 4 (2015), 345--361. https://doi.org/10.1111/jcal.12099
[20]
Sajed Jalil, Suzzana Rafi, Thomas D LaToza, Kevin Moran, and Wing Lam. 2023. Chatgpt and software testing education: Promises & perils. In 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 4130--4137. https://doi.org/10.1109/ICSTW58534.2023.00078
[21]
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences, Vol. 103 (2023), 102274. https://doi.org/10.1016/j.lindif.2023.102274
[22]
James A Kulik and JD Fletcher. 2016. Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of educational research, Vol. 86, 1 (2016), 42--78. https://doi.org/10.3102/0034654315581420
[23]
Harsh Kumar, Ilya Musabirov, Mohi Reza, Jiakai Shi, Anastasia Kuzminykh, Joseph Jay Williams, and Michael Liut. 2023. Impact of Guidance and Interaction Strategies for LLM Use on Learner Performance and Perception. arXiv preprint arXiv:2310.13712 (2023). https://doi.org/10.48550/arXiv.2310.13712
[24]
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023 a. Comparing code explanations created by students and large language models. arXiv preprint arXiv:2304.03938 (2023). https://doi.org/10.48550/arXiv.2304.03938
[25]
Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves, Paul Denny, James Prather, and Brett A Becker. 2023 b. Using large language models to enhance programming error messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 563--569. https://doi.org/10.1145/3545945.3569770
[26]
Mark Liffiton, Brad E Sheese, Jaromir Savelka, and Paul Denny. [n.,d.]. Codehelp: Using large language models with guardrails for scalable support in programming classes. ( [n.,d.]), 1--11. https://doi.org/10.1145/3631802.3631830
[27]
Atharva Mehta, Nipun Gupta, Dhruv Kumar, Pankaj Jalote, et al. 2023. Can ChatGPT Play the Role of a Teaching Assistant in an Introductory Programming Course? arXiv preprint arXiv:2312.07343 (2023). https://doi.org/10.48550/arXiv.2312.07343
[28]
Jesse G Meyer, Ryan J Urbanowicz, Patrick CN Martin, Karen O'Connor, Ruowang Li, Pei-Chen Peng, Tiffani J Bright, Nicholas Tatonetti, Kyoung Jae Won, Graciela Gonzalez-Hernandez, et al. 2023. ChatGPT and large language models in academia: opportunities and challenges. BioData Mining, Vol. 16, 1 (2023), 20. https://doi.org/10.1186/s13040-023-00339--9
[29]
Hyacinth S Nwana. 1990. Intelligent tutoring systems: an overview. Artificial Intelligence Review, Vol. 4, 4 (1990), 251--277. https://doi.org/10.1007/BF00168958
[30]
Derek H. Ogle, Jason C. Doll, A. Powell Wheeler, and Alexis Dinno. 2023. FSA: Simple Fisheries Stock Assessment Methods. https://CRAN.R-project.org/package=FSA R package version 0.9.4.
[31]
OpenAI. 2024. Assistants Overview - OpenAI API. https://platform.openai.com/docs/assistants/overview. Accessed: 2024-02--11.
[32]
OpenAI. 2024. ChatGPT. https://openai.com/chatgpt. Accessed: 2024-02--11.
[33]
OpenAI. 2024 a. Code Interpreter. https://platform.openai.com/docs/assistants/tools/code-interpreter. Accessed: 2024-02--11.
[34]
OpenAI. 2024 b. Knowledge Retrieval. https://platform.openai.com/docs/assistants/tools/knowledge-retrieval. Accessed: 2024-02--11.
[35]
Maciej Pankiewicz and Ryan S Baker. 2023. Large Language Models (GPT) for automating feedback on programming assignments. arXiv preprint arXiv:2307.00150 (2023). https://doi.org/10.48550/arXiv.2307.00150
[36]
Mike Perkins, Jasper Roe, Darius Postma, James McGaughran, and Don Hickerson. 2023. Detection of GPT-4 generated text in higher education: Combining academic judgement and software to identify generative AI tool misuse. Journal of Academic Ethics (2023), 1--25. https://doi.org/10.1007/s10805-023-09492--6
[37]
Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023 a. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. arXiv preprint arXiv:2302.04662 (2023). https://doi.org/10.48550/arXiv.2302.04662
[38]
Tung Phung, Victor-Alexandru Pua durean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023 b. Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors. International Journal of Management, Vol. 21, 2 (2023), 100790. https://doi.org/10.48550/arXiv.2306.17156
[39]
Russell A Poldrack, Thomas Lu, and Gavs per Beguvs. 2023. AI-assisted coding: Experiments with GPT-4. arXiv preprint arXiv:2304.13187 (2023). https://doi.org/10.48550/arXiv.2304.13187
[40]
James Prather, Paul Denny, Juho Leinonen, Brett A Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton-Reilly, et al. 2023. The robots are here: Navigating the generative ai revolution in computing education. arXiv preprint arXiv:2310.00658 (2023). https://doi.org/10.1145/3623762.3633499
[41]
R Core Team. 2022. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[42]
Steven Ritter, John R Anderson, Kenneth R Koedinger, and Albert Corbett. 2007. Cognitive Tutor: Applied research in mathematics education. Psychonomic bulletin & review, Vol. 14 (2007), 249--255. https://doi.org/10.3758/BF03194060
[43]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. 27--43.
[44]
Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. Large language models (gpt) struggle to answer multiple-choice questions about code. arXiv preprint arXiv:2303.08033 (2023). https://doi.org/10.48550/arXiv.2303.08033
[45]
Brad Sheese, Mark Liffiton, Jaromir Savelka, and Paul Denny. 2023. Patterns of Student Help-Seeking When Using a Large Language Model-Powered Programming Assistant. arXiv preprint arXiv:2310.16984 (2023). https://doi.org/10.1145/3636243.3636249
[46]
Derek Sleeman and John Seely Brown. 1982. Intelligent tutoring systems. London: Academic Press.
[47]
Robert A Sottilare, Keith W Brawner, Benjamin S Goldberg, and Heather K Holden. 2012. The generalized intelligent framework for tutoring (GIFT). Orlando, FL: US Army Research Laboratory-Human Research & Engineering Directorate (ARL-HRED) (2012).
[48]
Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. 2024. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024). https://doi.org/10.48550/arXiv.2401.05561
[49]
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine, Vol. 29, 8 (2023), 1930--1940. https://doi.org/10.1038/s41591-023-02448--8
[50]
David R Thomas. 2006. A general inductive approach for analyzing qualitative evaluation data. American journal of evaluation, Vol. 27, 2 (2006), 237--246. https://doi.org/10.1177/1098214005283748
[51]
Ulrich Trautwein and Olaf Köller. 2003. The relationship between homework and achievement-still much of a mystery. Educational psychology review, Vol. 15 (2003), 115--145. https://doi.org/10.1023/A:1023460414243
[52]
Kurt VanLehn. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational psychologist, Vol. 46, 4 (2011), 197--221. https://doi.org/10.1080/00461520.2011.611369
[53]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022). https://doi.org/10.48550/arXiv.2206.07682
[54]
Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F Wong, and Lidia S Chao. 2023 b. A survey on llm-gernerated text detection: Necessity, methods, and future directions. arXiv preprint arXiv:2310.14724 (2023). https://doi.org/10.48550/arXiv.2310.14724
[55]
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023 a. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023). https://doi.org/10.48550/arXiv.2303.17564
[56]
JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny can't prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--21. https://doi.org/10.1145/3544548.3581388
[57]
Yixuan Zhang, Yimeng Wang, Nutchanon Yongsatianchot, Joseph D Gaggiano, Nurul M Suhaimi, Anne Okrah, Miso Kim, Jacqueline Griffin, and Andrea G Parker. 2024. Profiling the Dynamics of Trust & Distrust in Social Media: A Survey Study. (2024). https://doi.org/10.1145/3613904.3642927
[58]
Jiawei Zhou, Yixuan Zhang, Qianni Luo, Andrea G Parker, and Munmun De Choudhury. 2023. Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--20. https://doi.org/10.1145/3544548.3581318
[59]
Kyrie Zhixuan Zhou, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Ted Underwood, Ece Gumusel, Mengyi Wei, Abhinav Choudhry, and Jinjun Xiong. 2024. "The teachers are confused as well": A Multiple-Stakeholder Ethics Discussion on Large Language Models in Computing Education. arXiv preprint arXiv:2401.12453 (2024). https://doi.org/10.48550/arXiv.2401.12453
[60]
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022). https://doi.org/10.48550/arXiv.2211.01910

Cited By

View all
  • (2024)Risk management strategy for generative AI in computing education: how to handle the strengths, weaknesses, opportunities, and threats?International Journal of Educational Technology in Higher Education10.1186/s41239-024-00494-x21:1Online publication date: 11-Dec-2024
  • (2024)Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studiesComputers & Education10.1016/j.compedu.2024.105224(105224)Online publication date: Dec-2024

Index Terms

  1. Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    L@S '24: Proceedings of the Eleventh ACM Conference on Learning @ Scale
    July 2024
    582 pages
    ISBN:9798400706332
    DOI:10.1145/3657604
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 July 2024

    Check for updates

    Author Tags

    1. field study
    2. large language models
    3. tutoring

    Qualifiers

    • Research-article

    Conference

    L@S '24

    Acceptance Rates

    Overall Acceptance Rate 117 of 440 submissions, 27%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,075
    • Downloads (Last 6 weeks)433
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Risk management strategy for generative AI in computing education: how to handle the strengths, weaknesses, opportunities, and threats?International Journal of Educational Technology in Higher Education10.1186/s41239-024-00494-x21:1Online publication date: 11-Dec-2024
    • (2024)Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studiesComputers & Education10.1016/j.compedu.2024.105224(105224)Online publication date: Dec-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media