To investigate student programmers’ attention patterns during both facets of code summarization, we designed our experiment to include two conditions: Reading and Writing.
Participants completed both conditions, all while their visual behavior was recorded using eye-tracking. In the remainder of this section, we discuss participant recruitment, the study materials used in the task (i.e., Java methods, eye-tracking), the task design, and the experimental protocol.
3.2 Study Materials
In this section, we discuss specifics related to the origin of the Java methods as well as the eye-tracking hardware, software, and setup.
Java Methods The Java methods and associated summaries used in this study originate from the publicly available FunCom dataset of 2.1 million Java methods collected from open-source projects [
11,
52]. This dataset has been used, filtered, and refined in previous research [
10,
39]; we continue this lineage using a sample of 205 methods employed in prior human studies [
12,
39]. For this study, we indented and formatted these 205 methods according to Java coding conventions [
58]. To fit the screen constraints, we omitted methods that either exceeded 26 lines of code or contained lines of code that wrapped onto the next line. The final dataset after cleaning based on these characteristics consisted of 162 Java methods. Before filtering, the average method length in the dataset was 12.37 lines of code (
\(\sigma\) = 4.72), with an average line length of 27.38 characters (
\(\sigma\) = 27.25). In terms of cyclomatic complexity, methods in this pre-filtered dataset had an average complexity of 2.53 (
\(\sigma\) = 1.59), where each count is the number of linearly independent paths through a method.
After filtering, the average method length in the dataset was 11.72 lines of code (
\(\sigma\) = 4.26), with an average line length of 26.52 characters(
\(\sigma\) = 26.31). The average method complexity in our final sample was 2.59, with a standard deviation of 1.56. The shortest method was 5 lines of code, whereas the longest was 26 lines of code. The simplest methods had a complexity of 1, and the most complex method in our sample had a complexity of 11. Method summaries ranged from 3 to 13 words long (i.e., “refresh tree panel”), with an average of 8.29 and a standard deviation of 2.78 words. These methods were randomly assigned to either the Reading condition or Writing condition, ensuring that each method could only be used for one condition (i.e., methods used in the Reading condition were not reused in the Writing condition). With data collection for eye-tracking in mind, we ensured that the selection of Java methods follows the best practices in Software Engineering for participant fatigue as well as constraints for hardware and software [
76].
Eye-tracking Eye-tracking data was recorded using a Tobii Pro Fusion eye-tracker mounted on a 24” computer monitor (1920 × 1080 resolution) with a refresh rate of 60 Hz [
2]. The eye-tracking model is accurate down to 0.1 to 0.2 in on the screen (0.26–0.53 cm). We developed a task interface to run locally using Python Flask that presented Java methods and recorded participant input. An example of this can be seen for both conditions in Figure
1. To record eye-tracking data, we integrated the Tobii-Pro Software Developer Kit into the task [
1]. The Java methods were presented at font size 14, without syntax highlighting. To improve data quality for eye-tracking, participants were asked to wear contact lenses instead of glasses when possible. We ensured that the materials and methodology were consistent across both institutions and followed a script when interacting with participants.
3.3 Task Design
We used a within-subjects experimental design: each participant completed both the Reading and the Writing conditions, but whether a participant started in Reading or Writing was randomized. The entire pool of 162 Java methods was randomly split between Reading and Writing so that participants would see a given method in only one context. For each experimental session (i.e., for each participant), 65 Java methods of the 162 were randomly selected and presented. Methods were presented as stimuli; each stimulus consisted of one method and a summarization task specific to that condition (i.e., writing a summary, rating a pre-written summary). Of these 65 stimuli, 40 were presented in Reading and 25 in Writing.
To maximize both the variety and amount of eye-tracking data collected with respect to our stimuli, we purposefully ensured that \(60\%\) of the stimuli were seen among all participants, whereas the other \(40\%\) was taken from the larger pool (reserved for that condition). Before we began collecting eye-tracking data, we randomized which stimuli comprised the 60% seen by all participants and which made up the larger pool from which the remaining 40% was sampled for each experimental session. During the experimental sessions, the order in which the stimuli were presented was also randomized. In total, 89 Java methods were covered in the Reading condition, of which 24 were seen by everyone. In the Writing condition, 67 Java methods were covered, with 15 of those being seen by every participant. Thus, 156 of the 162 methods were covered during data collection.
Three breaks were built into the task, both for participants to rest, and for the researchers to recalibrate the eye-tracker (for data quality). There was no time limit for breaks. Participants were notified of breaks via “rest” slides built into the interface. These were placed in the following locations: one halfway through the Reading condition, one in between the two conditions, and one halfway through the Writing condition. For example, if participants started with the Writing condition, they would first complete 13 stimuli, take a break, then finish the remaining 12. Before starting the Reading condition, they would take a second break. Now in Reading condition, participants would finish 20 stimuli, then take their third break. They would then finish the remaining 20 stimuli of the Reading condition.
In the Reading condition, participants were shown Java methods on the left side of the screen and the corresponding summary in the upper-right. Likert-scale questions were located below the pre-written summary. For Writing, a text box for participants’ summaries was located in the upper right of the screen. Example stimuli can be seen in Figure
1. Pre-written summaries in the Reading condition were either human written and from the original dataset of Java methods from open-source projects [
52] or generated via Deep Learning [
11,
39]. We discuss quality control for these summaries below. To ensure that this summarization task was treated as a code comprehension task, participants were given four Likert-scale questions for each stimulus requiring them to closely read both the summary and the code. Questions 1 to 3 were previously validated [
39], whereas the fourth was added for the purposes of this current study. The questions were on a scale of 1 to 5, ranging from Strongly Disagree to Strongly Agree, and based on the following criteria:
(2)
Whether the summary is missing information
(3)
Whether the summary contains unnecessary information
Quality Control The quality of pre-written summaries could potentially influence programmer attention on the code. Thus, we removed data associated with egregiously low-quality summaries. The summaries were previously validated as well [
11], but we implemented further checks to bolster the data quality for the current study. Specifically, using a 1 to 5 scale, we excluded data from summaries that had an average score of 4 or greater for questions (2) and (3) above, and an average score of 2 or below on questions (1) and (4). In other words, we removed data associated with pre-written summaries if, on average, participants
agreed it was missing information and contained unnecessary information and
disagreed that it was accurate and readable. In total, data associated with 4 pre-written summaries was removed.
Likewise, participant-written summaries that do not match the code indicate poor comprehension. Here, we assume that participants formed mental models of the code based on information they saw. If participants’ understanding of a method was malformed, this may be reflected in their eye-tracking data as well. While we wanted a variety of comprehension types and skill levels within the dataset, we also sought to ensure that the eye-tracking data represents a base level of comprehension. For example, we excluded eye-tracking data from a summary that included “to be honest, I am not entirely sure what this function is doing.” Therefore, two of the authors of this article rated and subsequently filtered participant summaries using the same Likert-scale questions mentioned above [
25]. The two researchers (8 years of Java experience, 5 years of Java experience) first rated participants’ summaries independently. The researchers then resolved any rating conflicts together (i.e., a valence mismatch: Agree/Disagree, Strongly Disagree/Agree, Neutral/Agree), obtaining an
Inter-Rater Reliability (IRR) of 1 [
53]. Using these ratings, eye-tracking data associated with 5 participant summaries was excluded. Informally, if we consider one participant's eye-tracking data for one stimulus as a single data point, the final dataset contained 996 samples for the Reading condition and 661 samples for the Writing condition. We discuss how this eye-tracking data was analyzed to compare both forms of code comprehension in Section
4.
Statistical Power Using the effect sizes of results in previous research as a guideline [
78], we evaluated the statistical power of data collected in this study. Sharif et al. conducted a study with 15 participants, comprised of undergraduate and graduate students as well as two faculty members. That study used a within-subjects study design, and reported small to moderate Cohen's
d effect sizes, with a minimum of 0.15, a maximum of 0.57, and an average effect size of 0.27. In this study, we used a within-subjects design for comparing between Reading and Writing, and a between-subjects design for comparing based on different demographics. Using G-Power, we calculated that we would need 150 total data points in comparing Reading and Writing to obtain sufficient statistical power to detect the average effect size of 0.27 with an alpha of 0.05 [
32]. In other words, we would need at least 75 samples in both conditions. As previously mentioned, we obtained 996 samples for the Reading condition and 661 samples in the Writing condition.
For analyzing differences between groups based on demographic factors, we would need between 68 samples (d = 0.15) and 963 samples (d = 0.57) in both groups, or 298 samples in both for an effect size of 0.027. In this study, we compared participants based on years of experience, gender, and native language. For expertise, we did not include all participants’ data in our analyses, and instead split the participants into three groups, only comparing between participants with the lowest amount of experience with those with the highest. Based on our sample of students, we considered participants to be novices if they had 4 years of experience or less (n = 10), and experts if they had more than 7 years of experience (n = 9). We excluded data from the middle group in our comparison to yield a starker contrast between experts and novices. For the Writing condition, we obtained 215 and 242 samples from experts and novices, respectively. In the Reading condition, we collected 336 and 370 samples from experts and novices, respectively.
Our sample is sufficiently powered for comparing based on gender and native language but is not ideal due to other characteristics of the dataset. For gender, only one woman in our sample is in the expert group and, with respect to native language, only two native English speakers are in the expert group. To ensure that the samples were not biased towards experts, we excluded
all experts in our analyses, comparing student programmers based on gender and native language. We collected 270 samples from women (
n = 7) in the Reading condition and 174 samples in the Writing condition. From men (
n = 11), we collected 390 samples in the Reading condition and 272 samples in the Writing condition. Next, we compared native English speakers (
n = 12) and non-native English speakers (
n = 6). From the former, we collected 460 samples for Reading and 295 for Writing. From the latter, we collected 200 samples for Reading and 151 for Writing. As an additional consideration, the remaining non-native English speakers represent 5 other languages, which may introduce additional variables in the comparison. We nonetheless analyzed these factors to explore potential differences and understand their influence on the larger dataset. We present
preliminary analyses based on gender and native language in Section
5.4.
3.4 Experimental Protocol
Programmers were recruited to take part in the experiment via in-class presentations, flyers, forum posts, and mailing list advertisements. Individuals who contacted the researchers completed the consenting and prescreening processes electronically (the experimental session was in person). After programmers gave their consent, they completed a prescreening questionnaire to ensure that they were eligible for the study. Individuals were eligible for the study if they were at least 18 years old, had taken Data Structures or the equivalent, had at least 1 year of Java experience, and no history of epileptic seizures [
2]. In addition, we gave programmers a prescreening question to test their basic Java understanding, following previous work [
11]. We asked them to briefly describe the purpose of an obfuscated Java method (i.e., in-order tree traversal). All participants included in the current study met our eligibility criteria and wrote accurate descriptions.
Participants completed the summarization tasks in person, in an office with natural lighting. At the beginning of each experimental session, participants completed a pre-task survey containing questions related to age, gender, native language, and classes taken. The pre-task survey was limited in scope to reduce any priming effects [
56]. The researcher would then give participants scripted instructions for the experiment and calibrate the eye-tracker using the Tobii Pro Eye Tracker Manager [
2]. The eye-tracker itself may have introduced observer bias, as participants might change their behavior knowing their gaze was being recorded [
84]. While a certain amount of experimental bias is unavoidable [
21], we as researchers attempted to minimize observer bias by leaving the room while participants completed the task. Participants were instructed to alert the researcher once they reached the breaks (Section
3.3). After each break, the researcher recalibrated the eye-tracker.
After finishing the experimental session, participants completed a post-task survey. The post-task survey asked participants about their preferred coding languages, coding experience, and personal criteria for high- and low-quality summaries. Experimental sessions lasted about 90 minutes. The summarization tasks alone (i.e., Reading, Writing, breaks) lasted roughly 75 minutes.