The objective of this study is to assess how a student programmer approaches understanding two different programming languages: C++ and Python, in two different task types: bug fixing and new feature addition. Each student saw both C++ and Python code for the tasks. Eye movements were recorded during the entire study to objectively determine what students were looking at as they performed the tasks. The tasks themselves are not directly comparable, as we wanted to avoid any learning effects; however, they do use similar, semantic constructs as shown in Section
4.3. In this section, we present the participant demographics, sampling procedures, tasks, stimuli in the experimental design, eye-tracking hardware used, the terminology used, and the tools we used to collect measures to answer our research questions. We followed the practical guide on conducting eye-tracking experiments while designing the experiment [
61].
4.1 Participant Characteristics
The participants were mainly students from a large Midwestern university in the United States. Fourteen volunteers participated in the study.
Table
1 shows a summary of the demographic information collected from the participants. Eleven participants were male and three of them were female. Eight participants were between 18 and 24 years old, two participants were between 25 and 34 years old, one was between 34 and 44 years, and three were over 45 years old. There were six undergraduate students, four graduate students, and four non-students (who had just graduated) among the participants. Nine participants did not have any industry employment and experience, and five participants indicated that they had industry experience. Netbeans was the most used IDE among the participants, with six participants choosing it as one of the IDEs they use for programming. Eclipse and Visual Studio were the next popular choices, appearing in the participants’ answers five and four times, respectively.
We asked the participants to self-report their programming skills and experience levels. Siegmund et al. [
68] state that self estimation is a reliable measurement of programming skills and experience. Eight participants rated their design skills as average, five rated them as above average/good, and one rated them as excellent. Six participants rated their programming skills as average, six rated them as above average/good, and two rated their skills as excellent. As for programming language specific questions, five participants ranked their C++ skills as beginner level, five ranked their skills as intermediate, one ranked their skills as average, and finally, three participants ranked their skills as advanced. Five participants had between 1 and 2 years of experience in C++ programming, three participants had between 3 and 5 years of experience, three participants had between 6 and 10 years of experience, two participants had more than 10 years of experience, and finally, one participant had no experience in C++ programming. Subsequent questions were about the participants’ skills in Python. Five participants said that they did not know Python. One ranked their skills as beginner level, five ranked their skills as intermediate, and three ranked their skills as advanced. Five participants had no experience in Python programming, five participants had between 1 and 2 years of experience, three participants had between 3 and 5 years of experience, and one participant had between 6 and 10 years of experience. Finally, the participants were asked to list the languages they could program in. Java was mentioned in the answers nine times, with C++, C and Python coming as the next most mentioned answers, respectively.
4.3 Conditions and Design
The four different combinations of programming language and task types used in this experiment are listed in Table
2. A high level description of the programs and the programming constructs that are present in the program are also listed. Two tasks were presented in Python, and two were presented in C++. From each language category, one task was a bug fixing task, and the other was a feature addition task.
For the bug fixing tasks, we asked the participants to find the bug located in the program, write the line number they thought contained the bug, and attempt to fix the bug. They were also given the expected input and output of the program. For the feature addition tasks, we gave the participants a description of the program’s current capabilities and a description of an additional feature that they had to implement. Figure
1 shows Stimulus 1 and Stimulus 4. A complete replication package with all the tasks, programs, and eye movement data is available at Reference [
40].
Participants were given all four tasks in randomly generated order. They had access to the source code in Geany,
2 the console output of the program, and the requirements of the task. Requirements included the expected input and output for the bug fixing tasks and the additional feature that needed to be implemented for the feature addition tasks. Figure
2 shows an image of the screen setup and these three areas. There was a trial task given to familiarize participants with the IDE setup so they could ask questions. We did not collect eye-tracking data for the trial task.
We now provide some rationale for why we chose these two types of tasks (new feature and bug fix). As developers, we perform a variety of tasks on a daily basis, such as bug fixing, feature addition, refactoring, code review, testing, reading requirements, reading to comprehend code, summarizing code, and many more. Almost all of the eye-tracking studies in program comprehension are on tasks that involve participants summarizing Java code, and very few are on fixing bugs. There are none on adding new features. Moreover, all studies (except for a few) are on short code snippets and all on Java. Besides Turner et al. [
75] there are no published studies looking into eye movements on Python that we are aware of. It has been shown by Abid et al. that results derived from short code snippets are not always consistent with when you use realistic programs within an IDE to test developers [
2]. To bridge this gap, we chose two of the activities we believe developers spend a lot of time on, i.e., fixing bugs, and adding new features. In the future, we will add more task categories as provided by Murphy et al. [
44].
The goal of this article was to see how participants fare on different types of tasks. The tasks themselves are different categories and should not be considered comparable. The goal was to see how the same individual’s eye movements differed between the different types of tasks. The two tasks chosen are representative of what software developers typically do, i.e., fix bugs and implement new features, as also evidenced by many issue tracker systems in open source projects.
Our underlying assumption (based on theoretical frameworks such as Reference [
18] that looked at different tasks albeit without eye tracking) is that bug fixing and new feature tasks would require different levels of comprehension and problem solving skills. For bug fixes, developers generally start with the bug report and/or expected input/output and try to figure out which line the bug is on by tracing backward to find the line via stack traces or some other tracing method. With new feature tasks, developers do not do as much tracing, since they are implementing forward based on the requirements they read and what the expected feature should do. Because of these reasons, we believe that solving these tasks would generate different user behaviors.
For the bug fixing tasks, the requirements of the task were somewhat comparable. One task reverses a word or phrase (C++) and the other creates a palindrome (Python). The new feature tasks, however, were slightly different, albeit they used similar constructs listed in Table
2. Since the study design is within subjects, giving very similar programs across languages would cause learning effects that we wanted to avoid. Since we recruited our participants from a Python class, we also wanted to make sure we chose stimuli with concepts already taught in the class. We asked the instructor for their syllabus and weekly schedule to ensure we used programs and concepts that was known to the students. We did not use verbatim any code from the class itself. We were not instructors for the course.
Note that the goal of this article was not to do a side by side comparison of the same task in C++ versus Python. Instead, it was to see how each participant understood C++ versus Python in two task categories. To account for the difference in lines of code in the tasks, we make sure we normalize our fixations per character, because otherwise, longer programs will always have more fixations as there is more to read (see Section
4.6 for more details on normalization). For future work, we plan to evaluate different program complexities [
3,
21] within each task category. However, task complexity was not the scope of this article.
After each task, we asked the participants to rate the difficulty of the tasks, with the options: “Easy,” “Average,” and “Difficult.” For the statistical analysis, we assigned the numbers 1, 2, and 3 to these choices, respectively. We also asked the participants to rate their confidence level about each task, with the options “Not Confident,” “Somewhat Not Confident,” “Somewhat Confident,” and “Very Confident.” For the statistical analysis, we mapped these choices to the numbers 1, 2, 3, and 4, respectively.
4.4 Terminology
We provide definitions for basic terminology we use throughout the article to help provide the reader with context for our study.
Program comprehension is a sub-field of software engineering/computer education that deals with a user building a mental representation (albeit subjective) of the code while solving a task.
Task Type refers to the various possible types of tasks a developer (in this case, a student) may engage in. Possibilities could be bug fixing, new feature addition, refactoring, code review, testing, and so on. In this article, we only evaluate two task types (bug fix and new feature addition).
Task refers to the actual set of artifacts that falls into the specific task category. For a bug fix task, this would be the code in the IDE, the program requirements, and expected input and output of the program. For the new feature task it would be the starter code in the IDE, current description of the program, and a description of the additional feature to be implemented. In both cases, the console output was also available to the participant. The participant is expected to engage with these artifacts to produce a result. In the case of the bug fixing task, the result would be the line that had the bug and a fix for the bug. For the new feature addition tasks, the result would be the newly written code that implements the new feature.
Bug localization (in our study) refers to the time when the participants read the line containing the bug, prior to any edits made, but do not fix the bug.
Stimuli is eye-tracking terminology and simply means anything that is tracked on the screen by the eye tracker. In our case, the Geany IDE was the main stimulus that contained within it all the artifacts that the participant saw.
Chunks refer to a line or set of contiguous lines of code with a specific logical and semantic meaning. We also refer to them as beacons [
6,
80].
Areas of Interest (AOI) refer to parts of the stimulus on which eye-tracking metrics are recorded. Examples could be chunks in the code editor, the requirements area of the IDE, or the console output. The AOI is usually defined by the researcher.
Fixation is the stabilization of the eyes on an object of interest for a certain period of time. Fixations are made up of multiple raw gazes and have a duration associated with them, which we refer to as the fixation duration. Most processing happens during fixations, which is why they are a standard measure in most eye-tracking studies.
Scanpath refers to the directed path formed by saccades between fixations. It determines how the eye navigates across the stimuli.
Reading behavior refers to the percentages of fixations that appear on the various AOIs in question.
Navigation behavior refers to the scanpath on source code over time.
Editing refers to the act of modifying the code to fix a bug or implement a new feature.
4.6 Measures
The measures used in this experiment are based on best practices guidelines reported in the field of program comprehension, software engineering, and eye tracking [
61]. We direct the reader to Duchowski et al. [
20] for a detailed theoretical description of all eye-tracking measures. Table
3 describes the metrics used to compare participants’ behavior while working on the four tasks. We specify the research questions, the metrics used to answer the questions, and the definition of the metrics. We chose metrics based on fixations, a group of metrics used in eye-tracking studies in software engineering [
60,
61] that are used to measure visual effort. In prior studies, areas of interest with higher fixation count and duration are believed to have attracted more visual attention or that understanding them required more effort [
61,
63,
64]. We calculated the total fixation count and duration over the given tasks, per character and specific AOIs. Furthermore, we calculated the mean fixation duration during each specific task as well. The fixation count and duration serve as a measure of visual effort when it comes to solving the different types of tasks (bug fixing and feature addition) and comparing the different programming languages (Python and C++) in our research questions.
Next, we explain why we use the fixation count per character as a metric. To compare eye-tracking data across the different programming languages, we first need to normalize the data. The programming languages C++ and Python have very different semantic structures and different lines of code. There might not be a direct equivalent construct between the languages. To account for the difference in lines of code in the tasks, we make sure we normalize our fixations, because indeed longer programs will have more fixations. We account for this in our analysis by normalizing by character. To do this, we divide eye movement duration over a token. So if the total fixation duration is 400
\(\text{ms}\) on a token of four characters, the normalized total time is 100
\(\text{ms}\). This has been done in prior work as well by Madi et al. [
38] and Abid et al. [
2].