1. Introduction
In research on ergonomics in general and human–machine interactions in particular, but also in many clinical or educational settings, technical applications aim to track humans’ current processing demands under varying, real-world conditions. Examples are the presentation of information “just in time” (e.g., when users have a higher likelihood of processing the information), warning the user via visual signals in cases where cognitive load reaches levels that significantly increase the chances of missing crucial information, or monitoring an audience in an educational context to gauge the speed or rate of information presented and processed (cf. [
5]). Cognitive demands or load and its tracking are therefore a major topic in medical education and applied clinical settings [
6]. Clinical interviews, for example, are often conducted under highly distracting conditions (e.g., in an emergency room) and, yet, the correct processing of the information presented is of the highest relevance in this situation (e.g., to decide if a patient needs to be admitted to the hospital, yes or no) [
7]. The problem is aggravated by the fact that self-report measures of cognitive load or demands do not necessarily converge. For instance, Naismith et al. [
8] used different self-report measures of cognitive load in simulation-based medical training and found little convergence between these measures. For these reasons, researchers seek to identify unobtrusive, pervasive, and affordable technology to track processing demands or cognitive load in areas such as the workspace or educational institutions [
9]. One such technology is eye-tracking (e.g., [
12]). Here, one dependent variable suited to track processing demands or cognitive load is pupil diameter (cf. [
15]). Higher cognitive load elicits a pupil dilation and, accordingly, research has indicated that measured pupil diameter can be used to successfully discriminate between more or less demanding conditions or conditions requiring more or less mental effort (e.g., [
However, a general problem that complicates the read-out of processing demands or cognitive load from measured pupil sizes is that pupillary dilation is a relatively unspecific response elicited by a variety of different conditions, not all of which are diagnostic of cognitive load (e.g., [
20]). Among the critical triggers of pupillary responses are changes in the ensuing luminance level (e.g., [
22]), saccades ([
24]), felt emotions (e.g., [
27]), value (e.g., [
29]), arousal (e.g., [
31]), psychoactive substances (e.g., [
32]), and pathological conditions on the side of the observer (e.g., [
33]). Thus, many factors that independently modulate the pupillary response pose a threat to the successful measurement of cognitive load from pupil dilations. For example, if an emotion such as joy or anger, elicits a pupil dilation, there is potentially less room for a load-specific response on top of the one triggered by the emotion [
35]. This can be a problem, as the valence of the processed content can vary independently of cognitive load. To understand this, think of a clinical interview (e.g., [
37]). In such a situation, the demands imposed on the interviewer can vary depending on whether the interviewer has to encode only verbal responses of the patient or whether she also pursues additional aims, such as evaluating emotional content by classifying nonverbal expressions, such as mimic and gestures (cf. [
40]). At the same time, in this situation, the emotional content of the answers may vary (cf. [
41]). Sometimes the interviewee might relate emotionally neutral content such as her age or occupation, but, at other times, the interviewee might also talk about personal traumatic or stressful past experiences. The emotional valence of the responses can be entirely independent of the cognitive demands imposed by the task—for example, whether the interviewer processes only verbal or verbal and nonverbal communication. Thus, in such a situation, monitoring the cognitive load imposed on the interviewer by measuring pupillary responses may be reduced. For example, if an emotional response already triggers pupil dilation, there would maybe be less room for a pupil response to the higher demands. Thus, it might be difficult to identify questions or phases of the interview where difficulty increases and where questions possibly have to be rephrased.
Can load be assessed through pupillary responses in all such cases? It should be noted that emotional effects can impact pupil sizes by effects that do not correspond to a cognitive-load effect itself. It is true that, for example, some emotional processing such as an observer’s anxiety may draw on her or his limited capacity and, thus, impose a cognitive load itself [
42]. In such cases, emotional and load effects converge to increase load, and there is no need to tell the two apart and to derive a pure measure of load. However, other emotional content besides one’s own felt stress or anxiety can also elicit pupillary changes, but these are load-independent effects. In these cases, to make use of a pupil-based load measure, it is important to tell apart which pupil-size changes are true reflections of cognitive load and which are due to emotions. For example, it would be important to know if the emotion-specific but load-independent pupillary effect limits the sensitivity of the pupil’s response to an additional load effect. It is exactly such a situation, where emotion and load lead to independent effects on pupil size, that we studied in the present research. We did this with a clinical interview in which patients—in fact, a confederate of the experimenter—voiced more neutral or more negative content. Here, the more negative content is independent of the cognitive load per se, and, thus, has the chance to compromise our cognitive-load measure by load-independent means.
As explained, clinical interviews are common practice and an important tool in diagnosis, methods to understand and improve their application are needed [
44]. Therefore, the present study investigated exactly the question of whether varying processing demands can be successfully and unobtrusively tracked by measurements of pupil dilation under conditions of varying emotional content of the answers of the interviewee. To this end, the current study used four different groups varying with respect to the processing demands or cognitive loads on the one hand, and independently of the load with respect to the emotional content of the answers, on the other hand. All participants acted as interviewers conducting an interview with a confederate of the researchers that gave prepared verbal answers and displayed generic mimics and gestures. Participants in two of the groups had to monitor the verbal content of the answers of the interviewee only, both for current classification of the responses in the course of the interview (by button presses) and for later recall (in a memory test). These were the low-load or low-demand (L) conditions. Participants in the other two groups had to monitor both the verbal and nonverbal content of the interviewee, again for current classification of the responses during the interview and for later recall. These were the high-load or high-demand (H) conditions. In each of the two load conditions, one group of participants heard more neutral, less emotional answers (NEU), while the other group heard less neutral, more negative (NEG), and, thus, more emotional answers. Thus, there were four groups (NEUL, NEUH, NEGL, and NEGH) altogether.
To track the processing demands, we measured the pupillary responses of the interviewers with an eye-tracker and analyzed the data conventionally, and also used an algorithm that automatically subtracts light-elicited pupillary responses [
19]. This algorithmic measure of cognitive load could be used online, with little temporal delay between recording and resulting identification of cognitive load experienced. Yet, it stands as external validation by comparison with a more conventional pupil-size measure. This was what we did in the current study: We compared how the algorithm faired in comparison with a standard measure of pupil dilation under the relatively uncontrolled conditions of an interview. In addition, we took various measures of objective task performance, such as the interviewer’s correct classifications of both the verbal (in L and H groups) as well as nonverbal responses (H groups only) during the interview, and of the interviewer’s performance in a subsequent memory test about the content of the interview. We also collected several measures of the interviewers’ self-assessment regarding both the emotions experienced during the interview as well as the felt subjective demands imposed by the interview and tasks.
The following were our hypotheses:
Hypothesis 1. If the algorithm for automatic extraction of the cognitive load through pupillary responses serves its purpose, we expected to see the above-chance classification of our participants into those that experienced higher processing demand versus those that experienced lower processing demand. If external validity is granted, we also expected a good correspondence between the algorithmic measure and the more conventional measurements of cognitive load based on subtracting pupillary light responses as a baseline from conditions in which the same pupillary light response is likely elicited but on top of this cognitive load triggers a pupil dilation.
Hypothesis 2. However, because emotions can elicit pupil dilation, just as the task demands, we also expected this classification to work better under conditions with neutral rather than emotional (here, negative) response content. As a ground truth, we also used raw pupil size measured by the eye-tracker itself, without the preprocessing steps taken by the algorithm such as its automatic modeling of an impulse corresponding to the pupillary light response. To this end, raw pupil sizes during a baseline phase of the interview were subtracted from raw pupil sizes measured during the manipulated parts of the interviews (“during manipulation” in short) that varied according to response content (neutral vs. negative) and task or processing demands (low vs. high). The resulting differences were analyzed with a 2 × 2 analysis of variance (ANOVA), with between-participant variables processing demand (low vs. high) and verbal content (neutral vs. negative).
Hypothesis 3. Here, we expected a main effect of task load/processing demand, with larger pupil sizes under high- than low-load conditions (cf. [34]) and possibly a main effect of emotion, with larger pupil sizes under negative than under neutral-content conditions (e.g., [25]; but see [35] for an example of failure to observe emotion-elicited pupil dilations with verbal material of varying task difficulty). Hypothesis 4. In addition, we expected to see an interaction, for example, a compromised cognitive-load effect under emotional conditions compared to neutral conditions if the emotion-elicited pupil dilation can mask or compromise the processing demands.
Hypothesis 5. Finally, we looked into the correlation between the algorithmic cognitive-load measure and the ground truth and expected a significant and substantial correlation between the measures, corresponding to external validation of the algorithmic cognitive-load measure.
Besides these most important analyses, we also analyzed objective task performance and subjective assessments, where we would have expected better objective performance under low than high task-load conditions, lower interviewers’ self-assessed effort and demands under low- than under high-load conditions, or more positive ratings under neutral- than under negative-content conditions. In addition, to study the proper functioning of the automatic extraction of the light-elicited pupil response from the algorithmically extracted cognitive-load measure, we correlated these two values and expected no correlation between the two measures. In contrast, if the light-elicited pupillary response can limit the available room for a cognitive-load response, we might see a negative correlation between the two measures, meaning on average fewer pupillary responses to higher demands under darker than under lighter conditions.
2. Methods
2.1. Participants
Forty psychology students from the University of Vienna participated in the experiment (27 female, 13 male, Mage = 22.2, range 19–29 years). The sample size was based on an a priori power calculation assuming a small effect size (η2 = 0.20), striving for a statistical power of 0.8, and allowing an alpha error of 0.05. Participants were randomly assigned to one of four groups: neutral content/low workload (NEUL; nine participants), neutral content/high workload (NEUH; nine participants), negative content/low workload (NEGL; nine participants), and negative content/high workload (NEGH; nine participants). Four participants had to be excluded from the study: Two due to a faulty dataset, where data from the eye-tracker were missing and could not be evaluated correctly. One participant’s dataset in the NEUH group had to be excluded since he/she did not perform the Verbal–Visual Task. One participant in the NEGH group had to be excluded due to a very small pupil size at the start. Each participant received partial course credit and had normal or corrected-to-normal visual acuity. Prior to the experiment, informed consent was obtained from all participants. Ethical approval (No. EK_00617) was obtained from the University of Vienna’s Ethical Review Board.
2.2. Procedure and Task
Participants wore a head-mounted eye-tracker and were asked to conduct an adapted clinical interview with another person and to additionally ask about the interviewee’s experiences with health issues in her family. The interview was conducted via an online video-call using BigBlueButton (2021). Participants were told to read and ask 67 questions (see
Supplementary Tables S1–S5) in sequence and wait until the interviewee had responded to each question before asking the next question.
Unbeknownst to the participants, the person to whom they were interviewing (i.e., the interviewee) was a trained confederate of the experimenters, who had been informed about the study’s aim beforehand. The confederate answered the questions of the participant in a fixed, prescribed manner, either in an on average more negative (e.g., “My mother died after she was diagnosed with corona”) or a more neutral (e.g., “My mother was quarantined after she was diagnosed with corona”) manner. In addition, the confederate also used pre-specified and practiced emotional facial expressions and gestures (e.g., frown and/or lower her gaze) during her responses.
Out of the 67 questions, 7 were pre-experimental information questions and another 11 were used as baseline questions. The interviewee’s answers to these pre-experimental and baseline questions were the same, regardless of which emotional content group participants were assigned to (i.e., regardless of the manipulation). The participants’ pupil responses to the baseline questions served as a reference for comparison with the pupil responses in the later, manipulated part of the experiment (see below), as both the content of the questions and of the answers during baseline and manipulation touched upon less socially desirable personal information about the behavior and or the experiences of the interviewee. After the baseline questions, participants were instructed to ask another 49 questions, the answers to which were manipulated and depended on the group the participants were assigned to. This part of the interview was labeled the “manipulation”. In the neutral emotional content condition (low and high cognitive load; NEUL and NEUH), the interviewee answered the participant’s questions in a neutral manner (see
Supplementary Table S4). In contrast, in the negative emotional content condition (low and high cognitive load; NEGL and NEGH), the interviewee answered participants’ questions in a more negative manner (see
Supplementary Table S5). The difference between pupil sizes during baseline and manipulation was the conventional measure of demand-elicited pupillary changes during the manipulation, as the lighting conditions during baseline and manipulation were the same and the two phases of the experiment differed only in terms of the workload and the emotional content of the answers. The interview itself lasted between 25 and 30 min.
Regarding the tasks, all participants had to remember as much of the interviewee’s responses as they could, since they were tested for their memory of the interview afterward. Participants’ mental or cognitive (work)load was manipulated by giving participants one or two tasks. One of these tasks was the same for all participants: Participants had to discriminate if the interviewee only referred to herself while giving an answer, or if she mentioned anybody else as well. If she only talked about herself, participants were to press the “Y” key on the keyboard at the end of an answer, while they were to press the “X” key if the interviewee mentioned someone else (“Verbal Task”). In addition, participants in the high-load conditions (NEUH and NEGH) were asked to discriminate the match between the interviewee’s facial expressions and the content of her verbal responses. If they were congruent (e.g., frowning while telling them that her mother had died), participants had to press the “P” key, but if they were incongruent (e.g., smiling while telling them that their mother died), they had to press the “O” key at the end of an answer (“Verbal–Visual Task”).
2.3. Apparatus and Material
The interview was conducted in a dimly lit laboratory, where the only light source was the monitor. The interview was conducted via a 31 cm × 28.5 cm monitor (resolution, 1920 pixels × 1080 pixels; 60 Hz screen refresh rate). Participants were seated in front of the monitor, with their gaze straight ahead, centered on the screen. Viewing direction and distance (60 cm) were supported by a chinrest. Participants wore a mobile, video-based eye tracker (Pupil Labs, Berlin, Germany; [
45]), sampling at 60 Hz, with an estimated gaze accuracy of 0.6° (according to the manufacturer). A PC running Windows 10 (Microsoft, Redmond, WA, USA) and Pupil Labs software (pupil-labs.com/pupil/; version 3.0.7, accessed on the 8 October 2021) was connected to the eye-tracker for recording of pupil size, eye movements, and the external visual surroundings. The Pupil-Player software ran with default settings, and the eye-tracker was calibrated before the start of the interview as instructed by the pupil-labs documentation (
https://docs.pupil-labs.com/core/software/pupil-capture/, accessed on the 8 October 2021). Participants also wore headphones (Gindoly PC Headset with microphone). BigBlueButton was used for the video-call, in which participants were able to see the interviewee via webcam. PyCharm Community Edition 2020.2.3 was employed for the interview questions and response collection. Participants responded on a standard “QWERTZ”-keyboard. The interview consisted of seven pre-experimental questions, 11 baseline questions, 16 questions regarding the COVID-19 pandemic, and 33 reworded questions from the Diagnostic Interview for Mental Disorders (DIPS; [
46]). Following the interview, a questionnaire consisting of 10 items tested whether participants paid attention to and, hence, recollected the content of the interviewee’s responses. Additionally, participants were asked to fill out the NASA-Task-Load-Index (NASA-TLX; [
47]) and the Self-Assessment Manikin (SAM; [
48]) at the end of the interview to measure their self-assessed felt emotional state and mental workload during the interview. Lastly, participants were asked whether they felt that the interview was faked (that they knew that the interviewee was actually a confederate of the researchers) to see if the credibility of the confederate’s answers influenced participants’ self-assessments and/or their performance.
2.4. Eye-Tracking Data Processing
The pupil data were exported using the Pupil Player v3.0.7 with minimum data confidence of 0.6. Confidence is an assessment by the pupil detector on how sure it is about the measurement. This measurement is taken for each frame and each eye. Pupil labs suggests that any confidence value greater than 0.6 corresponds to useful data. Further data processing was conducted using PyCharm Community Edition 2020.2.3. The whole dataset was reduced to a single eye, which was chosen because of its higher overall average confidence. The extracted pupil data of the Pupil Player provide two types of pupil diameter values: one 3D-corrected diameter of the pupil scaled to millimeters, based on an anthropomorphic average eyeball diameter and corrected for perspective and one raw diameter in image pixels as observed in the eye image frame and not corrected for perspective. We used the latter for our analysis. The data were further restricted to those sections in which the interviewee gave her answers, removing all sections in which the participant read the questions to the interviewee. Artifacts (such as unexplainable spikes and dips in pupil-size measurements).
2.5. Extraction of Cognitive Load
The analysis of cognitive load is obtained via an algorithm for online analysis of cognitive load from pupil data (see [
49]) that have been validated for their internal validity [
19]. With the conditions of the study not being fully stable in terms of the changing video luminance and with the majority of pupillary activity being associated with reflexes to environmental luminance and not mental activity, we added a component to model and compensate these environmental effects to obtain a reliable estimation of cognitive load.
For this purpose, we implemented a dynamic baseline to compensate for environmental changes by exploiting the first-person camera as a luminance sensor. These luminance data were being fed into empiric models of pupillary reflex behavior [
50] to obtain a corresponding score of luminance-related pupil dilation, and temporal adaption of pupillary behavior [
51] and to compute realistic temporal adaptions of the pupil (for a detailed description of the modeling of the dynamic baseline, please refer to Gollan [
As a next step, the dynamic baseline is subtracted from the measured pupil dilation score to obtain the effects that are associated with arousal and accordingly cognitive load. These data are processed by the cognitive-load analysis (see [
49]), which uses the mathematical modeling of the task-evoked pupillary response (TEPR) [
52] to measure cognitive load. This empirical model of pupillary response to cognitive activities has been transferred into an online analysis algorithm [
49]. The online deconvolution algorithm performs a curve-matching approach in a frame-wise feedback loop. To be precise, the algorithm measures the difference between the current modeled pupil value and the current actual pupil dilation measure. If this difference exceeds a threshold of 0.25% change in pupil dilation, a new attentional pulse, wi(si, ti), with scale si and temporal onset ti, is dynamically added to the list of attentional impulses.
3. Results
3.1. Conventional Pupil-Size Measures
Looking at the average pupil-size responses, two trends could be observed. First, a very small pupil size at the start of each answer relative to the participant’s average pupil size. This was likely caused by the pupillary light reflex in response to the difference in screen luminance from a screen on which participants had read a question to an image of the interviewee while she answered the question. Second, we can see a rise in average pupil size between answers to baseline questions (Questions 8 to 18) and answers to questions differing in emotional content (between groups;
Figure 1).
This increase in pupil diameter from the baseline to the manipulation—that is, the interviewee’s emotionally more or less negative responses, can be found in every single participant. When comparing the pupillary changes between groups, the largest average change between the baseline and the manipulations occurred in the high-mental-workload conditions (NEUH: with a delta of 3.134 pixels; NEGH: with a delta of 2.786 pixels), followed by participants in the low-workload/negative-content group (NEGL: delta of 1.260 pixels), and lastly the participants in the NEUL group (delta of 0.607 pixels). Although
Figure 1 shows that participants in the NEGH Group have had the smallest pupil diameter of all groups, this only means that the average pupil diameter in this group was smaller in the baseline condition to begin with.
Figure 2 underlines the difference in pupil diameter change between the baseline and the manipulation in all groups. While participants in the NEUL group showed the smallest increase in pupil diameter, when the baseline average is subtracted from the average during manipulation, participants in the NEUH group show the biggest increase. Participants in the NEGL group as well as those in the NEGH group showed a similar difference, suggesting that the increase in pupil size due to the emotional content quantitatively limited the influence of mental workload on pupil size.
An ANOVA yielded a significant main effect of workload, F(1, 35) = 6.95, p = 0.013, η2 = 0.17, but the influence of Emotion, F(1, 35) = 0.03, p = 0.858, η2 = 0.002, and the interaction, F(1, 35) = 0.32, p = 0.576, η2 = 0.01, were not significant. A post hoc t-test between low- (M = 0.49, SD = 3.14) and high-workload conditions (M = 3.53, SD = 3.68) showed significantly smaller pupillary changes in low-workload conditions compared to high-workload conditions, t(35) = −2.667, p = 0.012, η2 = 0.89.
Further calculations showed that the main effect of workload was due to a faster average shrinking of pupil sizes of participants in the low-workload conditions over the course of the interview compared to participants in the high-workload group. A point of divergence, from which the pupil size of the low-workload groups significantly differed from those of the high-workload groups, is the 27th question in the interview: An ANOVA using only questions after the 27th question yielded a very similar pattern of results as the initial ANOVA, Workload: F(1, 35) = 7.73, p = 0.009, Emotion: F(1, 35) = 0.05, p = 0.818, and Workload × Emotion: F(1, 35) = 0.39, p = 0.537. The results were similar when only including trials in which participants correctly responded to both the Verbal Task as well as the Verbal–Visual Task (the latter only for participants in the high-workload conditions), Workload: F(1, 35) = 6.90, p = 0.013, Emotion: F(1, 35) = 0.15, p = 0.699, and Workload × Emotion: F(1, 35) = 0.13, p = 0.725.
For the following analyses, participants were divided into two groups based on their self-assessment scores in the NASA-TLX’s subscale “Mental-Demand” and the SAM’s subscale “Happiness” (see
Figure 3 and
Figure 4, respectively). Participants who scored equal or below the median score in the respective subscales were assigned to the “low” group, while those who scored above were assigned to the “high” group, and the resulting classifications were tested for their correspondences (with correspondence indicated by the symbol “~” in
Figure 3) to the manipulations (i.e., NASA-L/SAM-HAP ~ NEUL, NASA-H/SAM-HAP ~ NEUH, NASA-L/SAM-SAD ~ NEGL, NASA-H/SAM-SAD ~ NEGH).
When comparing the pupillary changes between groups based on their self-assessments, the results of the ANOVA were not significant, showing that the self-assessments should be interpreted with caution, all Fs < 1.00 (NASA-TLX-Values, F [1, 35] = 0.34, p = 0.565, η2 = 0.01; SAM-Values, F[1, 35] = 0.02, p = 0.902, η2 = 0.001; (NASA-TLX-Values × SAM-Values, F[1, 35] = 0.419, p = 0.522, η2 = 0.01).
We also collapsed data across self-assessed emotions to check if this improves the picture. However, a two-sample
t-test of pupil diameter changes between participants who scored high (which means that participants felt a high mental demand during the task) in the NASA-TLX subtask “Mental-Demand” (
M = 2.37,
SD = 4.02) and those who scored low (
M = 1.63,
SD = 3.51) showed no significant difference either,
t(35) = −0.589,
p = 0.559, η
2 = 0.2 (see
Figure 5).
The comparison of pupil diameter changes between participants who believed that the interview was fake (
M = 1.70,
SD = 4.58) and those who believed that it was real (
M = 2.17,
SD = 3.09) showed no significant difference either,
t(35) = 0.369,
p = 0.715, η
2 = 0.12 (see
Figure 6). This is not very surprising given that the workload manipulations were real, such that the beliefs regarding the authenticity of the interviewee should not corrupt the demands imposed by the task(s) at hand.
3.2. Algorithmically Extracted Workload Measure
The same analysis was conducted using data derived from a “cognitive workload” algorithm, which subtracts the light-elicited pupillary changes ([
19]). Quantitatively, the pattern of results is overall similar to that using the raw pupillary changes, with the exception of that of the NEGH Group (see
Figure 7).
The increase in cognitive load between the baseline and the manipulation was, again, found in every single participant. When comparing the cognitive-load changes between groups, the largest average change between the baseline and the manipulations occurred in the high NEGL condition (NEGL: with a delta of 6.27), followed by participants in the high-workload/neutral-content group (NEUH: delta of 5.45), and lastly the participants in the NEGH group (delta of 4.47) and the NEUL group (delta of 3.45).
Figure 8 illustrates the difference in cognitive-load change between the baseline and the manipulation in all groups. While participants in the NEUL group showed the smallest increase in cognitive load when the baseline average was subtracted from the average during manipulation, participants in the NEUH group as well as those in the NEGL group showed similar increases. Participants in the NEGH group showed a rather small increase in cognitive load as well.
An ANOVA yielded a significant interaction between the workload and emotional content manipulation, F(1, 35) = 4.77, p = 0.036, η2 = 0.13, but neither the workload manipulation, F(1, 35) = 1.38, p = 0.248, η2 = 0.04, nor the main effect of emotion, F(1, 35) = 0.44, p = 0.509, η2 = 0.01, were significant. In neutral emotion conditions, a pairwise post hoc t-test between low- (M = 4.64, SD = 3.43) and high-workload conditions (M = 6.96, SD = 3.21) showed significantly smaller pupillary changes in low-workload conditions compared to high-workload conditions, t(35) = −2.108, p = 0.047, η2 = 0.7. In contrast, in negative emotion conditions, the t-test between low- (M = 5.42, SD = 3.49) and high-workload conditions (M = 6.18, SD = 3.52) did not show a significant change in pupil size, t(35) = −0.651, p = 0.047, η2 = 0.7. This result was due to a load effect in the neutral conditions that numerically even slightly reversed under negative conditions.
3.3. External Validation of Algorithmically Extracted Workload Measure
As explained, the algorithmic cognitive-load measure provides a continuous simulation of the pupil response to the current task that is free of a luminance-created contribution to the same pupillary response. To externally validate the cognitive-load algorithm, we looked at the observed correlations between ground truth and algorithmically derived cognitive-load measures. Ground truth was calculated in the conventional way, by subtracting average baseline pupillary size from the pupillary size at a precisely defined point in time during the manipulation phase. Note: This ground-truth method aims to eliminate the influence of luminance on the pupillary response in the manipulation phases and, thus, corresponds in conventional offline calculations of pupillary responses to what the cognitive-load algorithm does online. To calculate these observed correlations, we first selected the strongest maxima (or highest local peaks) in the modeled cognitive-load responses. In this way, we targeted points in time of assumedly highest signal-to-noise ratio and strongest change in cognitive load across time. In addition, to generalize the insights gained by correlations between the algorithmic model and the ground truth measure, we repeated this method for 20 time points randomly selected from a period of 250 ms before and after a local maximum of the algorithmic impulses.
For the pupil response to each answer of each participant, we correlated the algorithmic cognitive-load measure and baseline corrected pupil-size changes (for example, see
Figure 9 and
Figure 10).
After calculating a correlation per individual question and for each participant, mean correlations for each participant (see
Supplementary Table S6) and for each group were calculated: NEUL:
R² = 0.634; NEUH:
R² = 0.652; NEGL:
R² = 0.565, NEGH:
R² = 0.648. To statistically assess the presence or absence of significant correlations in our observed correlations, we applied a nonparametric resampling procedure. To this end, we randomly reshuffled individual samples of ground truth and algorithmic measures of all used pupillary responses per each participant and, thus, separately per group. We then calculated correlations on the reshuffled data described above for the observed data and repeated this procedure 10,000 times. This created a distribution of 10,000 correlations per participant, from which we determined the statistical thresholds (
p = 0.0125; i.e., Bonferroni corrected for tests in four separate groups). In other words, only observed correlations exceeding 98.75% of the surrogate correlations were deemed significant. The assessment showed a significant difference between observed measurements and randomly shuffled measurements in each participant and for each answer.
The same nonparametric resampling procedure was conducted to assess a difference in correlation between all four groups. Here, we observed a significantly reduced correlation between ground truth and cognitive load in the NEGL group: (NEUL-NEGL: p = 0.004; NEUH-NEGL: p < 0.001; NEGH-NEGL: p = 0.001). This loss in correlation might also be related to the missing evidence for a clear cognitive-load-dependent pupil-size effect in the NEGL group compared to the NEGH group.
A boxplot diagram with each participant of each group, showing the ranges of all correlation scores per answer provides additional descriptive data on the found correlations (see
Figure 11).
In addition to correlating the changes in cognitive load to the changes in pupil size, we conducted a Pearson correlation showing the correlation between the algorithm’s extracted cognitive load and the light-elicited pupillary changes. In the first step, we observed that changes in measured luminance, mostly due to the changes between question and answer displays, were negatively correlated, r(1,262,808) = −0.55, p < 0.001. This was as expected since higher luminance should lead to pupillary constriction. In contrast, in the second step, though significant due to a huge number of trials, we observed almost no meaningful correlation between changes in luminance and cognitive load, r(1,262,808) = 0.02, p < 0.001. Together, results suggested that (1) changes in pupil size due to changes in luminance were successfully captured and that (2) luminance-elicited pupillary responses did not compromise the sensitivity of the algorithm for detection of cognitive load.
3.4. Self-Report Data
The participants’ subjective assessment of task difficulty was measured using the NASA-TLX), as can be seen in
Figure 12. Consistently lower felt mental demand was observed in neutral relative to negative conditions rather than in low-load relative to high-load conditions (Emotion:
F[1, 35] = 5.69,
p = 0.023; Workload:
F[1, 35] = 0.88,
p = 0.355; Interaction:
F[1, 35] = 1.89,
p = 0.179), where the post hoc
t-test revealed significantly higher levels of mental demand in the negative conditions (
M = 3.33,
SD = 4.23) than in the neutral conditions (
M = −0.33,
SD = 5.06), regardless of workload manipulation,
t(35) = −2.35,
p = 0.024, η
2 = 0.78. A relatively similar difference concerned participants’ felt physical demand (Emotion:
F[1, 35] = 6.28,
p = 0.017; Workload:
F[1, 35] = 0.25,
p = 0.619; Interaction:
F[1, 35] = 1.572,
p = 0.219) where the post hoc
t-test revealed significantly higher levels of physical demand in the negative conditions (
M = −3.39,
SD = 4.46) than in the neutral conditions (
M = −6.72,
SD = 3.43), regardless of workload manipulation,
t(35) = −2.51,
p = 0.017, η
2 = 0.84, and felt frustration (Emotion:
F[1, 35] = 7.99,
p = 0.008; Workload:
F[1, 35] = 0.20,
p = 0.658; Interaction:
F[1, 35] = 0.01,
p = 0.929) during the interview, where the post hoc
t-test revealed significantly higher levels of frustration in the negative conditions (
M = −2.22,
SD = 5.81) than in the neutral conditions (
M = 3.06,
SD = 5.07), regardless of the workload manipulation,
t(35) = −2.91,
p = 0.006, η
2 = 0.97. A Tukey-HSD post hoc evaluation revealed a significant difference between groups of different emotional manipulation in the “Mental-Demand” scale (
p = 0.023, 95%, CI = [0.534, 6.799]), in the “Physical-Demand” scale (
p = 0.017, 95%, CI = [0.626, 6.041]), and in the “Frustration” scale (
p = 0.008, 95%, CI = [1.475, 9.080]). All of these were perceived as stronger in the negative emotion condition compared to the neutral condition. All other differences in all other scales, as well as in all pairwise comparisons, were not significant.
The participants’ self-assessed performance regarding the verbal and verbal–visual task in the high- and low-workload conditions is backed up by their actual accuracy rates in the verbal task (see
Supplementary Figure S2). However, somewhat at odds with an experienced higher demand during the negative than during the neutral conditions, there were no significant performance (accuracy) differences in the verbal–visual task between the neutral and the negative high-workload (see
Supplementary Figure S3) groups. There were also no other significant differences between groups. In the memory test at the end of the interview (see
Supplementary Figure S1), groups also showed very similar results.
Subjectively felt emotions during the interview were assessed using the Self-Assessment Manikin (SAM), measuring the participants’ happiness, arousal, and control (
Figure 13). As expected, higher levels of happiness can be observed in the neutral conditions (Emotion:
F[1, 35] = 17.24,
p < 0.001; Workload:
F[1, 35] = 1.72,
p = 0.199; Interaction:
F[1, 35] = 0.43,
p = 0.517), where the post hoc
t-test revealed significantly higher levels of happiness in the neutral conditions (
M = 6.17,
SD = 1.54) than in the negative conditions (
M = 4.06,
SD = 1.51), regardless of workload manipulation,
t(35) = 4.15,
p < 0.001, η
2 = 1.38, while higher levels of arousal can be observed in the negative conditions (Emotion:
F[1, 35] = 5.91,
p = 0.021; Workload:
F[1, 35] = 3.01,
p = 0.092; Interaction:
F[1, 35] = 1.08,
p = 0.305). There was no difference between groups with regard to control (Emotion:
F[1, 35] = 2.29,
p = 0.140; Workload:
F[1, 35] < 0.01,
p = 1.000; Interaction:
F[1, 35] = 0.89,
p = 0.351).
4. Discussion
In the current study, we investigated the important question of whether cognitive (work)load or mental demands can be measured by pupil size, regardless of the emotional content of the stimuli used in a task. To this end, we selected a controlled but nonetheless more realistic scenario in which the measurement of cognitive load is of interest: clinical interviews. Under an applied perspective, the measurement of cognitive load in this situation is important for a variety of reasons, not least because the correct usage of data obtained in clinical interviews often also depends on a correct comprehension of the content of the messages conveyed in such an interview, be this through verbal or non-verbal communication [
7]. It should be noted that the currently available self-report measures are not necessarily consistent [
8], so an unobtrusive and objective measure would be desirable.
With respect to our hypotheses, we found the following results. Regarding Hypothesis 1, we found evidence that the algorithm was partly able to differentiate between low- and high-cognitive workload manipulations.
With respect to Hypothesis 2, however, numerically, although not statistically significant, we found that the effect was larger in neutral- rather than in negative-content conditions. In the negative-content conditions, pupil-size differences between workload conditions were absent.
Concerning Hypothesis 3, nevertheless, we did find evidence that cognitive load can be successfully derived from measured pupillary responses. This was reflected in the main effect of cognitive load on an increased pupillary response. This pattern of results was found using a conventional pupil-size measure of cognitive load—the difference in pupil size between the manipulation (here, the clinical interview proper) and a baseline phase during which the same light-elicited pupillary responses were observed as during the manipulation (but without the critical manipulation). In the current study, the baseline measure of pupillary response was recorded early during the interview, when questions concerned basic information about the interviewee.
Regarding Hypothesis 4, importantly, the observed numerical pattern of an interaction between cognitive workload and emotional content of the interview became significant in an analysis based on arithmetically calculated cognitive-load measures. This was found when we applied an algorithm that uses the currently measured luminance to calculate a pupillary light response, which in turn is subtracted online from the measured pupil size to extract cognitive load. The current study, thus, shows that the algorithmic cognitive-load measure might be more sensitive to the true pattern within the numerical data than the conventional way of calculating a pupillary cognitive-load response (i.e., based on an offline subtraction of pupil sizes during baseline from pupil sizes during manipulation phases).
Finally, with respect to Hypothesis 5, the algorithmic online measure was externally validated by reasonably high and significant correlations with the standard offline measure of cognitive workload in all of the groups and for each of our participants.
In summary, pupil sizes are a promising tool for the relatively unobtrusive measurement of cognitive load in applied scenarios, such as clinical interviews, and it might even be possible to derive estimates of the loads online, during data recording. This is particularly interesting for applied settings, such as clinical interviews, where it is essential that a continuous measure of load is available without delay, so as to allow the interviewer to pay special attention to what was said or to ask the patient to repeat the information. The problem is, however, that the sensitivity of this measure can be decreased by concomitant emotional content as demonstrated in the negative conditions.
Of further interest, the very low correlation between luminance-elicited and load-related pupillary changes indicated that at least the range of luminance values used in the current study does not impose a limit on the sensitivity of the load-dependent pupillary change measured by the algorithm. This was the case despite a significant negative correlation between measured luminance and pupillary responses that indicated that the luminance differences were strong enough to elicit a pupillary response.
From an applied perspective, in the future, the increasingly pervasive presence of cameras in laptops and smartphones may be used to measure pupil sizes without the need for additional equipment and, thus, to measure cognitive load without any additional technical requirements. Technologies such as smart glasses may further boost the widespread availability of the necessary camera or eye-tracking equipment. However, we have to admit that, currently, the resolution of cameras built into laptops and smartphones is insufficient, especially in combination with the frequently suboptimal recording conditions with low luminance, shadows, faces turned away from the screens and cameras, or glasses blocking the detection of the comparatively small pupil.
5. Limitations
Our cognitive-load manipulation was relatively weak. This was not only reflected in similar performance (accuracy) levels in both cognitive-load conditions—high- and low-load conditions but also in participants’ subjective self-assessments of cognitive load, which were the same in high- and low-load conditions. In contrast, to our surprise, participants subjectively felt and reported more cognitive and physical demands in negative than neutral conditions; a subjective rating that was not reflected in our pupil-size measures.
Another important limitation concerns doubts about the success of our manipulation of emotional content. Although participants subjectively felt more frustrated and generally reported more negative feelings under negative than under neutral conditions, some participants did not believe in the veridicality of the answers of the confederate/interviewee. This implies that the manipulation of the participants’ emotions was probably relatively weak. This problem was potentially exacerbated by the semi-controlled nature of our task as we will discuss next.
The current study used semi-controlled conditions: On the one hand, control itself was, thus, partly limited. For example, emotional content manipulations were confounded with the different exact words or phrases used by the interviewee. Another example is that the exact behavior of the confederate/interviewee varied slightly from participant to participant (e.g., in the form of the exact length of the verbal answer or the exact strength of a facial expression), even within groups, thereby creating additional and unwanted noise in the data. On the other hand, conducting interviews in front of a computer screen allowed some degree of control (e.g., easier calibration of the eye tracker) but might have compromised the perception of nonverbal facial expressions and could have mitigated emotional responses to the stimulus. In this way, control means that our conditions were not entirely the same as those encountered in the field, and our study is, thus, only a step in the direction of closing the gap between laboratory and field conditions.
Another potential limitation concerns insufficient power. It is well known that insufficient power can lead to oversights of existing effects that simply do not turn out to be significant. However, it is probably still less well known that a lack of sufficient power can also compromise the replicability of significant findings [
53]. In this respect, our study could also stand improvement in terms of a larger sample besides the aforementioned improvements in terms of the manipulations.
We used a measure of cognitive load that disregards the impact of saccades on pupil dilations. Our research is, thus, silent on how to incorporate saccade-elicited influences on pupil size into pupillary load measures. We think rightly so, as long as it is unclear if and how saccades are related to cognitive load. On the one hand, temporal attention and, thus, a particular cognitive-load factor might temporarily decrease saccade frequencies and related pupillary changes (cf. [
54]). On the other hand, spatial attention and, thus, a different factor affecting cognitive load might be accompanied by (micro-)saccades and their associated pupillary changes (e.g., [
56]), However, we agree that future research would benefit from discriminating between saccade-related pupillary changes and other load-elicited pupil responses in different tasks to understand when saccade-related pupil changes are true reflections of cognitive load and when they are better filtered out as not related to load or other difficulty related measures (cf. [
Since our interview was conducted in front of a computer screen, generalizability to a face-to-face scenario is not granted. Not only might more eye movements under 3D conditions lead to less accurate pupil-size data. Head and body movements of the interviewer reacting to the interviewee would also imply, for example, that light-elicited pupil changes can be more difficult to capture in a less-controlled 3D environment, where the direction of the scene camera and the looking direction can diverge more strongly, leading to less-accurate cognitive-load measurements.
Finally, we conducted our manipulations of emotional content and cognitive load independently of one another, assuming that these two manipulations affected pupil sizes in largely independent ways. However, subjectively, participants assessed the task demands (or load) and their frustration as significantly higher under negative than under neutral emotional conditions. Interestingly, this was not reflected in their performance accuracy. Participants in neutral and negative emotional conditions performed equally well. Thus, the subjective feelings could not be backed up by objective performance indicators. Nonetheless, the assumed independence of the two manipulations—load on the one hand, emotions on the other hand—needs further scrutiny in future studies, too.