4.1 Marking Performance
The student and ChatGPT-generated scripts for four modules were marked by ALs, with CS2 and CS3 being independently double marked. The final marks are shown in Table
5 and Figure
1.
In the case of the undergraduate modules (CS1, CS2, and CS3), every single ChatGPT-generated script achieved at least a passing grade (>40%). Every ChatGPT-generated script for CS4 was marked as a “fail” (postgraduate modules at the Open University have a higher pass threshold of 50%).
Within the passing grade, the Open University may choose to award a “distinction” grade for especially high-scoring student scripts. This is typically awarded for scripts scoring greater than 85%. Distinction grades would have been awarded to all of the CS1 ChatGPT-generated scripts. These results indicate that a student wishing to cheat by using ChatGPT to generate the entirety of their assignment could expect to pass these end assessments for CS1, CS2, and CS3.
As a postgraduate module, not only does the assessment of CS4 require deeper knowledge of topics than the other modules, but also it expects students to demonstrate greater proficiency of high-level learning skills such as synthesis and application of knowledge. A key aspect of the module's assessment is the expectation that students will apply (and reference) the theoretical knowledge found in the module materials to an organization of their choosing. This combination of deep subject-specific knowledge, high-level learning skills, and intense personalization of solutions proved challenging for ChatGPT, which produced unsatisfactory superficial and generic answers to the assessment.
A key part of module assessment is the determination of final grades. Realizing that any form of marking involves subjectivity on the part of a marker, module teams at the Open University may choose to re-mark scripts just below grade boundaries to decide whether a higher grade is deserved. Therefore, script documents lying close to these boundaries receive greater scrutiny. However, since most ChatGPT-generated scripts received scores away from boundary grades, had they been submitted as genuine assignment documents, they would not be further scrutinized before being awarded a grade. Perhaps the most significant finding is not that ChatGPT behaves as an outstanding student across some of the undergraduate modules, but that it performs consistently as an “adequate undergraduate student”—able to pass assessment without drawing undue attention to itself.
4.1.1 Comparison of the Script Sample against Cohort Norms.
Having established that most ChatGPT-generated scripts received passing marks, it is worth examining how representative the sample scripts—both ChatGPT generated and student—were to the wider cohort for those module presentations. Table
6 shows that for CS1, CS2, and CS3, the mean and standard deviations for marks align broadly with mean marks for the wider cohort, with mean ChatGPT-generated and sample student script scores lying within one standard deviation of the cohort mean. For CS1, the ChatGPT-generated scripts outperformed both the cohort and student sample, while ChatGPT underperformed relative to both for CS2 and CS3.
Once again, CS4 represents an outlier; not only did ChatGPT perform poorly in the assessment, but also the student sample is unrepresentative of the wider CS4 cohort. This does not affect the validity of the results for the ChatGPT-generated CS4 script documents, but it does prevent broader comparisons of the performance of the scripts against the cohort as a whole.
Table
7 shows the range of marks awarded for ChatGPT-generated and student scripts. In each case, the ChatGPT-generated scripts received a lower range of marks than the genuine student scripts from the markers. Since each of a module's ChatGPT-generated scripts was solely generated from identical prompts, there was limited scope for ChatGPT to generate radically different outputs. As previously stated, we did not edit the output of ChatGPT when creating the ChatGPT-generated scripts, so this low level of mark range may not be representative of actual cheating behavior where students may choose to supplement or alter the LLM's outputs.
Again, CS4 is an exception. The extremely large spread in student scripts on the module is in good part due to two student scripts being awarded zero by the marker. In one of these cases, the student had erroneously submitted a script to a different piece of assessment; in the second, the researchers believe the script mark was erroneously transcribed as zero and should have been awarded 16%.
4.1.2 ChatGPT Performance on Individual Questions.
We wanted to examine the marks awarded to individual questions to identify strengths and weaknesses in terms of the nature of topics being assessed. To do this, we categorized the questions by topic, examined the marks, and color-coded the question into six bands:
(1)
All scripts received a distinction mark for this question.
(2)
A majority of scripts received a distinction mark for this question.
(3)
A majority of scripts received passing marks for this question.
(4)
A majority of scripts received a fail mark for this question.
(5)
All scripts received a fail mark for this question.
Please see Appendix
A.1 for a complete breakdown for all questions in CS1, CS2, and CS3. CS4 was not analyzed as the assessment is a single extended report, and therefore is not suitable for a question-level analysis.
Based on this analysis, we loosely identified some trends. For CS1, questions requiring a discussion of program development improvements and definition questions regarding the usage/vulnerabilities of SQL all received a distinction mark. For the questions focused on security and hashing, which involved stating definitions and applying techniques to simple scenarios, a majority of scripts received a distinction mark. The lowest-scoring questions—receiving passing marks—assessed programming and program development; an essay on social, legal, and ethical issues around digital literacy; and reflecting on module performance.
For CS2, there is mixed performance across the questions. The essay question on the use of robotics in space exploration, including sourcing examples with references, was answered well, with a majority of scripts receiving a distinction mark. The essay question on operating systems, building on a video on the history of Unix, received passing marks. Broadly, the three short-answer opinion questions on operating systems, robotics, and networking also received passing marks. The personal development planning (PDP) questions were mostly handled poorly, with all the ePortfolio question responses failing, the majority of the self-reflection questions failing, and the future planning tending to receive passing marks. We did not use one of the optional long-form questions, as it was deemed unlikely that ChatGPT would be able to handle the practical networking activity using a network simulator without a huge amount of editorial work around the questions. This is consistent with our naïve approach of generating the ChatGPT scripts, as adapting the question would require too much effort from a naïve student.
For CS3, there is a significant consistent discrepancy between the two markers. That said, all 15 of the short-form questions requiring students to apply their understanding of a key HCI concept—from requirement gathering to design techniques and evaluation—broadly received passing or higher marks from one marker and majority distinction or higher from the other marker. For the long-form scenario question, the first part, which covers the development of interview questions, received passing or higher marks. The second part, requiring a heuristic evaluation of several interface screenshots, could not be passed through ChatGPT. The third and final part, requiring a redesign of an interface building on the heuristic evaluation, received passing or lower marks.
While it is obviously challenging for any given question to identify whether the question format or the topic covered was responsible or not for the ability of ChatGPT to generate compelling answers, this analysis demonstrates that, in most cases, across a range of question formats, topics, and study levels, ChatGPT is at least capable of producing adequate scripts.
4.2 ChatGPT-generated Script Identification
One of the most significant observations from the study was the distinction between markers’ ability to recognize suspect scripts and their decisions about which scripts to flag to the university. Although the number of scripts flagged as “suspected of plagiarism” by the markers was small, at interview it was clear that their identification of ChatGPT-generated scripts was more accurate than the flags alone might suggest.
Table
8 shows the plagiarism flags that the markers entered in the formal marking table during the marking exercise. An additional nine scripts were identified during the interview as appearing to contain “unusual or unexpected material” without meeting the bar for plagiarism flagging. This is discussed in depth in Section
4.3.
Of the three flags on student scripts, one relates to a script flagged by the original marker for potential breaches of the university's student code of conduct; this is one of two such cases among the scripts sent to markers. Both of the other scripts flagged by markers referred to scripts where two of the three short-form questions (each worth 5/100 marks) either rely on re-writing an external source or make too much use of quoted material. Neither of these scripts would likely be referred for an academic conduct investigation, given that the two long questions (30 marks each) contained no academic conduct issues, and the scripts scored 18% and 16%, respectively, on TurnItIn. Overall, the ALs were effective in detecting plagiarism in student scripts.
The ChatGPT scripts flagged on the marking table were identified by the markers for the following reasons:
•
M1 commented on one ChatGPT-generated script that “Parts of the Questions 6 and 7 appear to be missing” and on a different script that “[this script] and [the other flagged script] are similar for Q. 2(ii) for example.”
•
M2 noted on one ChatGPT-generated script: “A mixture of some very good elements, and some less good. Inconsistent across questions. In a TMA, this would raise concerns for me.”
•
M3 identified four distinct scripts from CS2, which shared a common, unusual approach to answering the same question, commenting: “Strange angle to answer - would swear it came from somewhere,” particularly in reference to the personal reflection element of the exam. These four scripts were identified as being unusual by M3, but the marker did not feel they met the threshold to raise a plagiarism flag.
From the in-script comments left by these markers, it is clear in each case that the questions raising concerns required students to provide either a personal viewpoint or some self-reflection. Either the answers provided are missing that sense of personal reflection (as with CS1) or a ChatGPT-generated script provides an answer written in the third person, talks about reflection in broad terms, and doesn't relate back to an individual experience. The next section builds on this analysis by considering the data from the interviews conducted with markers, examining their practices regarding academic misconduct more broadly.
At the end of the interview, we informed the markers that five of their scripts had been authored by ChatGPT. None of them were surprised that some of the scripts had been generated by ChatGPT. The markers were asked to revisit their identification of ChatGPT-generated scripts. All but one of the markers re-identified scripts they had previously flagged or identified as containing unusual or unexpected material, with the same justification. The two CS2 markers had, between them, flagged all five ChatGPT-generated scripts for that module; at interview, M2 increased identification from one flagged during marking to correctly identifying all five. M6 hadn't flagged any of the scripts for plagiarism in the marking feedback but immediately identified three of the ChatGPT-generated scripts (1, 10, and 11) and on reflection identified the two others (5 and 15). M6 also raised flags on some of the student scripts, likely due in part to the selection of student scripts not being reflective of the typical range (see Section
4.1.1). The only CS3 marker we interviewed—M4—responded by discussing suspicious symptoms rather than identifying specific scripts.
Overall, markers’ awareness of scripts containing unusual or unexpected material and their articulation of the characteristics that prompted suspicion were impressive, although their formal flagging of scripts was extremely selective.
4.3 Interview Analysis
We asked the interviewees about their practices regarding academic conduct: how they identified scripts of concern and how they decided whether to flag them. Four of the markers (M1, M2, M3, M4) highlighted a general expectation that plagiarism would be detected by either central members of the module team or the university's anti-plagiarism software. M3 wrote: “you always say to yourself why should I flag it up? Because it goes through three or four pieces of software anyway, and they'll flag it up.” This is a perfectly justifiable position, given a potential lack of understanding of central processes, limited time available for marking, and the range of different academic conduct issues that could be investigated: “there's so much out there I could spend a lot of time looking” (M3).
Markers M3 and M4 both highlighted that they tended to have an extremely high threshold for academic conduct, seeing such situations as an opportunity for teaching and developing students. M4 commented: “I haven't flagged it with anybody else unless it's blatant. I tend to actually put it to the students, you need to … put this in as a reference. And this means not just in the references at the bottom, but in-text citations to say that this is where I got this information from, and I tend to flag it to the student in that way.” One of the reasons for this approach was articulated by M3: a wariness of mislabeling student work, due to both the impact on the student and potential consequences for the marker themselves.
Given the close relationship between students and markers, it is perhaps unsurprising—particularly in circumstances where marking time is limited and robust procedures are in place for other module team members to assess misconduct—that markers focus principally on teaching, i.e., on helping students improve their practices, rather than defaulting to disciplinary procedures, unless the misconduct is both blatant and severe. While this is an exemplary practice, it is unclear how well it will serve if easily accessible LLM tools result in increased cheating.
The identification of potential cases of academic misconduct is especially difficult in a distance setting, where “there's a fine line between collaboration, peer learning, and collusion. And that's an interesting challenge” (M2). Given that much of the correspondence and interaction between students, and between students and staff, occurs asynchronously, typically online and often on third-party unmonitored services such as private Facebook groups, it can be challenging to work out where to draw a boundary between acceptably collaborative learning and outright plagiarism. Current AL practice—as noted previously—tends toward providing study skills support rather than activating disciplinary procedures, although this did depend partially on the module level.
The ALs were very clear in distinguishing between aspects of student submissions that served as flags of misconduct and aspects that were the result of typical student behavior.
The most-mentioned flag was a change in style in answers, potentially indicative of different authors. This could be differences in the layout of the document itself or changes in the use of language, such as changes in tone or voice, the use of technical vocabulary, the proficiency of writing, or the specificity of the answers given: “obviously if you're reading through something and there are significant language differences between answers to different questions” [M4]. This was particularly the case with changes in the use of technical language: “[where previous answers were] very general and didn't know any technical detail, and suddenly you get an answer that's full of technical detail, and something like that makes you very suspicious” [M3].
This sense of consistency as a key authenticity indicator led some tutors to gain a sense of the student voice by consecutively marking each complete script rather than marking each question in turn across all scripts. Similarly, as part of their usual marking practice, at least four of the markers would have looked back at students’ previously submitted work, as “you can spot that there's something really going wrong, this student is totally, totally different” [M3]. However, as M2 noted, while consistency matters, sometimes it's hard to identify issues as “they've just answered these questions on different days and didn't proofread.”
M1 noted that many of the stylistic flags were more pronounced in material requiring students to integrate external material, noting that, particularly at Level 1, “if they're asked to read a paper and they don't actually put things sufficiently into their own words….”
As the final flag, M6 noted that repetition acts as a significant flag: “it was the repetition that set me off”. Given the long-form essay style of this question in which a strong narrative element is expected, it is perhaps unsurprising that such a behavioral pattern, typified by ChatGPT's often formulaic responses, acts as a clear flag to markers.
The markers also identified student behaviors that while not flags for misconduct were baseline behaviors requiring study support. In introductory modules, tutors are more forgiving as “I think their quality of academic writing tends to be flakier because they just don't have the experience in it” (M2).
M4 identified the key points of assessment, namely: “One is, do you know what you're talking about? Two is can you apply it? Three is can you communicate that information in an effective way?” This was deemed challenging to use as a flag, as students display various sub-optimal behaviors. These include:
•
Not answering the question: “`ohh, it's about this. I'll just tell you everything I know about this.’ Especially in an exam situation; you're more likely to do that rather than stop and think” (M4).
•
Not showing work: “if they've done a calculation and they have not bothered to put any working in, and then suddenly there's one with lot of working” (M2).
•
Not applying the scenario context to the answer given: “I did find a lot of them weren't really utilising the scenario, the context they were given” (M4).
•
Student performance being extremely variable: “the quality of students’ work is often quite variable” (M2).
Recognizing that these are familiar student behaviors is limiting, since marking feedback for the ChatGPT-generated scripts shows similar patterns of behavior. Hence, these behavioral cues cannot be used to distinguish between student scripts and ChatGPT-generated scripts without additional evidence.
The version of ChatGPT used in this study typically generates false references [
53]; the research team examined every reference provided in the ChatGPT-generated scripts and found that all of the references that were included either were artificial or drew on material in the question used in the original prompt. We specifically asked markers about their behavior regarding checking references. Referencing has greater prominence in some modules (e.g., CS4 requires in-depth discussion of student-selected references; CS3 requires no references). Unsurprisingly, ALs pay greater or less attention to referencing based on the module they are marking. Both M4 and M2 noted that they typically scanned the formatting and venue of the reference for correctness but didn't check them for validity as the time for marking is so tight. M3 was similar but noted that they would follow the references if they suspected plagiarism. Both M1 and M6 said that they would follow references, with the CS4 tutor (M6) noting that based on the module material, “if you can't find a link to them, you can't identify them, then it's an automatic fail.” This helps account for the low scores received by many of the CS4 marked scripts and why it didn't result in the scripts being flagged for plagiarism.