Keywords

1 Introduction

The task of gaze estimation from a single low-cost RGB sensor is an important topic in Computer Vision and Machine Learning. It is an essential component in intelligent user interfaces [4, 13], user state awareness [15, 20], and serves as input modality to Computer Vision problems such as zero-shot learning [23], object referral [2], and human attention estimation [9]. Un-observable person-specific differences inherent in the problem are challenging to tackle and as such high accuracy general purpose gaze estimators are hard to attain. In response, person-specific adaptation techniques [29, 30, 37] have seen much attention, albeit at the cost of requiring test-user-specific labels. We propose a dataset and accompanying method which holistically combine multiple sources of information explicitly. This novel approach yields large performance improvements without needing ground-truth labels from the final target user. Our large-scale dataset (EVE) and network architecture (GazeRefineNet) effectively showcase the newly proposed task and demonstrate up to \(28\%\) in performance improvements.

The human gaze can be seen as a closed-loop feedback system, whereby the appearance of target objects or regions (or visual stimuli) incur particular movements in the eyes. Many works consider this interplay in related but largely separate strands of research, for instance in estimating gaze from images of the user (bottom-up, e.g.  [50]) or post-hoc comparison of the eye movements with the visual distribution of the presented stimuli (top-down, e.g.  [42]). Furthermore, gaze estimation is often posed as a frame-by-frame estimation problem despite its rich temporal dynamics. In this paper, we suggest that by taking advantage of the interaction between user’s eye movements and what they are looking at, significant improvements in gaze estimation accuracy can be attained even in the absence of labeled samples from the final target. This can be done without explicit gaze estimator personalization. We are not aware of existing datasets that would allow for the study of these semantic relations and temporal dynamics. Therefore, we introduce a novel dataset designed to facilitate research on the joint contributions of dynamic eye gaze and visual stimuli. We dub this dataset the EVE dataset (End-to-end Video-based Eye-tracking). EVE is collected from 54 participants and consists of 4 camera views, over 12 million frames and 1327 unique visual stimuli (images, video, text), adding up to approximately 105 h of video data in total.

Accompanying the proposed EVE dataset, we introduce a novel bottom-up-and-top-down approach to estimating the user’s point of gaze. The Point-of-Gaze (PoG) refers to the actual target of a person’s gaze as measured on the screen plane in metric units or pixels. In our method, we exploit the fact that more visually salient regions on a screen often coincide with the gaze. Unlike previous methods which adopt and thus depend on pre-trained models of visual saliency [6, 42, 43], we define our task as that of online and conditional PoG refinement. In this setting a model takes raw screen content and an initial gaze estimate as explicit conditions, to predict the final and refined PoG. Our final architecture yields significant improvements in predicted PoG accuracy on the proposed dataset. We achieve a mean test error of 2.49 degrees in gaze direction or 2.75cm (95.59 pixels) in screen-space Euclidean distance. This is an improvement of up to \(28\%\) compared to estimates of gaze from an architecture that does not consider screen content. We thus demonstrate a meaningful step towards the proliferation of screen-based eye tracking technology.

In summary, we propose the following contributions:

  • A new task of online point-of-gaze (PoG) refinement, which combines bottom-up (eye appearance) and top-down (screen content) information to allow for a truly end-to-end learning of human gaze,

  • EVE, a large-scale video dataset of over 12 million frames from 54 participants consisting of 4 camera views, natural eye movements (as opposed to following specific intructions or smoothly moving targets), pupil size annotations, and screen content video to enable the new task (Table 1),

  • a novel method for eye gaze refinement which exploits the complementary sources of information jointly for improved PoG estimates, in the absence of ground-truth annotations from the user.

In combination these contributions allow us to demonstrate a gaze estimator performance of \(2.49^\circ \) in angular error, comparing favorably with supervised person-specific model adaptation methods [7, 29, 37].

Table 1. Comparison of EVE with existing screen-based datasets. EVE is the first to provide natural eye movements (free-viewing, without specific instructions) synchronized with full-frame user-facing video and screen content

2 Related Work

In our work we consider the task of remote gaze estimation from RGB, where a monocular camera is located away from and facing a user. We outline here recent approaches, proposed datasets, and relevant methods for refining gaze estimates.

2.1 Remote Gaze Estimation

Remote gaze estimation from unmodified monocular sensors is challenging due to the lack of reference features such as reflections from near infra-red light sources. Recent methods have increasingly used machine learning methods to tackle this problem [3, 32, 36] with extensions to allow for greater variations in head pose [11, 31, 40]. The task of cross-person gaze estimation is defined as one where a model is evaluated on a previously unseen set of participants. Several extensions have been proposed for this challenging task in terms of self-assessed uncertainty [8], novel representations [38, 39, 52], and Bayesian learning [49, 50].

Novel datasets have contributed to the progress of gaze estimation methods and the reporting of their performance, notably in challenging illumination settings [25, 53, 54], or at greater distances from the user [14, 16, 24] where image details are lost. Screen-based gaze estimation datasets have had a particular focus [11, 16, 21, 25, 33, 53, 54] due to the practical implications in modern times, with digital devices being used more frequently. Very few existing datasets include videos, and even then often consist of participants gazing at points [21] or following smoothly moving targets only (via smooth pursuits) [16]. While the RT-GENE dataset includes natural eye movement patterns such as fixations and saccades, it is not designed for the task of screen-based gaze estimation [14]. The recently proposed DynamicGaze dataset [47] includes natural eye movements from 20 participants gazing upon video stimuli. However, it is yet to be publicly released and it is unclear if it will contain screen-content synchronization. We are the first to provide a video dataset with full camera frames and associated eye gaze and pupil size data, in conjunction with screen content. Furthermore, EVE includes a large number of participants (=54) and frames (12.3M) over a large set of visual stimuli (1004 images, 161 videos, and 162 wikipedia pages).

2.2 Temporal Models for Gaze Estimation

Temporal modelling of eye gaze is an emerging research topic. An initial work demonstrates the use of a recurrent neural network (RNN) in conjunction with a convolutional neural network (CNN) for feature extraction [35]. While no improvements are shown for gaze estimates in the screen-space, results on smooth pursuits sequence of the EYEDIAP dataset [16] are encouraging. In [47], a top-down approach for gaze signal filtering is presented, where a probabilistic estimate of state (fixation, saccade, or smooth pursuits) is initially made, and consequently a state-specific linear dynamical system is applied to refine the initially predicted gaze. Improvements in gaze estimation performance are demonstrated on a custom dataset. As one of our evaluations, we re-confirm previous findings that a temporal gaze estimation network can improve on a static gaze estimation network. We demonstrate this on our novel video dataset, which due to its diversity of visual stimuli and large number of participants should allow for future works to benchmark their improvements well.

2.3 Refining Gaze Estimates

While eye gaze direction (and subsequent Point-of-Gaze) can be predicted just from images of the eyes or face of a given user, an initial estimate can be improved with additional data. Accordingly, various methods have been proposed to this end. A primary example is that of using few samples of labeled data - often dubbed “person-specific gaze estimation” - where a pre-trained neural network is fine-tuned or otherwise adapted on very few samples of a target test person’s data, to yield performance improvements on the final test data from the same person. Building on top of initial works [25, 39], more recent works have demonstrated significant performance improvements with as few as 9 calibration samples or less [7, 29, 30, 37, 51]. Although the performance improvements are impressive, all such methods still require labeled samples from the final user.

Alternative approaches to refining gaze estimates in the screen-based setting, consider the predicted visual saliency of the screen content. Given a sufficient time horizon, it is possible to align estimates for PoG so-far, with an estimate for visual saliency [1, 6, 43, 45, 48]. However, visual saliency estimation methods can over-fit to presented training data. Hence, methods have been suggested to merge estimates of multiple saliency models [42] or use face positions as likely gaze targets [43]. We propose an alternate and direct approach, which formulates the problem of gaze refinement as one that is conditioned explicitly on screen content and an initial gaze estimate.

Fig. 1.
figure 1

EVE data collection setup and example of (undistorted) frames collected from the 4 camera views with example eye patches shown as insets.

3 The EVE Dataset

To study the semantic relations and temporal dynamics between eye gaze and visual content, we identify a need for a new gaze dataset that:

  1. 1.

    allows for the training and evaluation of temporal models on natural eye movements (including fixations, saccades, and smooth pursuits),

  2. 2.

    enables the training of models that can process full camera frame inputs to yield screen-space Point-of-Gaze (PoG) estimates,

  3. 3.

    and provide a community-standard benchmark for a good understanding of the generalization capabilities of upcoming methods.

Furthermore, we consider the fact that the distribution of visual saliency on a computer screen at a given time is indicative of likely gaze positions. In line with this observation, prior work reports difficulty in generalization when considering saliency estimation and gaze estimation as separate components [42, 43]. Thus, we define following further requirements for our new dataset:

  1. 1.

    a video of the screen content synchronized with eye gaze data,

  2. 2.

    a sufficiently large set of visual stimuli must be presented to allow for algorithms to generalize better without over-fitting to a few select stimuli,

  3. 3.

    and lastly, gaze data must be collected over time without instructing participants to gaze at specific pin-point targets such that they act naturally, like behaviours in a real-world setting.

We present in this section the methodologies we adopt to construct such a dataset, and briefly describe its characteristics. We call our proposed dataset EVE , which stands for “a dataset for enabling progress towards truly End-to-end Video-based Eye-tracking algorithms”.

3.1 Captured Data

The minimum requirements for constructing our proposed dataset is the captured video from a webcam, gaze ground truth data from a commercial eye tracker, and screen frames from a given display. Furthermore, we:

  • use the Tobii Pro Spectrum eye tracker, which reports high accuracy and precision in predicted gazeFootnote 1 even in the presence of natural head movements,

  • add a high performance Basler Ace acA1920-150uc machine vision camera with global shutter, running at 60 Hz,

  • install three Logitech C922 webcams (30 Hz) for a wider eventual coverage of head orientations, assuming that the final user will not only be facing the screen in a fully-frontal manner (see Fig. 1b),

  • and apply MidOpt BP550 band-pass filters to all webcams and machine vision camera to remove reflections and glints on eyeglass and cornea surfaces due to the powerful near-infra-red LEDs used by the Tobii eye tracker.

All video camera frames are captured at \(1920\times 1080\) pixels resolution, but the superior timestamp-reliability and image quality of the Basler camera is expected to yield better estimates of gaze compared to the webcams.

The data captured by the Tobii Pro Spectrum eye tracker can be of very high quality which is subject to participant and environment effects. Hence to ensure data quality and reliability, an experiment coordinator is present during every data collection session to qualitatively assess eye tracking data via a live-stream of camera frames and eye movements. Additional details on our hardware setup and steps we take to ensure the best possible eye tracking calibration and subsequent data quality are described in the supplementary materials.

3.2 Presented Visual Stimuli

A large variety of visual stimuli are presented to our participants. Specifically, we present image, video, and wikipedia page stimuli (shown later in Fig. 4).

For static image stimuli, we select the widely used MIT1003 dataset [22] originally created for the task of image saliency estimation. Most images in the dataset span 1024 pixels in either horizontal or vertical dimensions. We randomly scale the image between 1320 and 1920 pixels in width or 480 to 1080 pixels in height, to be displayed on our 25-inch screen (with a resolution of 1080p).

Fig. 2.
figure 2

Head orientation and gaze direction distributions are compared with existing screen-based gaze datasets [25, 54]. We capture a larger range of parameter space due to a multi-view camera setup and 25-inch display. 2D histogram plot values are normalized and colored with log-scaling.

All video stimuli are displayed in 1080p resolution (to span the full display), and taken from the DIEM  [34], VAGBA  [27], and Kurzhals et al.   [26] datasets. These datasets consist of 720p, 1080p, and 1080p videos respectively, and thus are of high-resolution compared to other video-based saliency datasets. DIEM consists of various videos sampled from public repositories such as trailers and documentaries. VAGBA includes human movement or interactions in everyday scenes, and Kurzhals et al. contain purposefully designed video sequences with intentionally-salient regions. To increase the variety of the final set of video stimuli further, we select 23 videos from Wikimedia (at 1080p resolution).

Wikipedia pages are randomly selected on-the-fly by opening the following link in a web browser: https://en.m.wikipedia.org/wiki/Special:Random#/random and participants are then asked to freely view and navigate the page, as well as to click on links. Links leading to pages outside of Wikipedia are automatically removed using the GreaseMonkey web browser extension.

In our data collection study, we randomly sample the image and video stimuli from the mentioned datasets. We ensure that each participant observes 60 image stimuli (for three seconds each), at least 12 min of video stimuli, and six minutes of wikipedia stimulus (three 2-min sessions). At the conclusion of data collection, we found that each image stimulus has been observed 3.35 times (\(SD=0.73\)), and each video stimulus has been observed 9.36 times (\(SD=1.28\)).

3.3 Dataset Characteristics

The final dataset is collected from 54 participants (30 male, 23 female, 1 unspecified). The details of responses to our demographics questionnaire can be found in our supplementary materials along with how we pre-process the dataset. We ensure that the subjects in both training and test sets exhibit diverse gender, age, and ethnicity, some with and some without glasses.

In terms of gaze direction and head orientation distributions, EVE compares favorably to popular screen-based datasets such as MPIIFaceGaze  [54] and GazeCapture [25]. Figure 2 shows that we cover a larger set of gaze directions and head poses. This is likely due to the 4 camera views that we adopt, together with a large screen size of 25 in. (compared to the other datasets).

Fig. 3.
figure 3

We adopt (a) a simple EyeNet architecture for gaze direction and pupil size estimation with an optional recurrent component, and propose (b) a novel GazeRefineNet architecture for label-free PoG refinement using screen content.

4 Method

We now discuss a novel architecture designed to exploit the various sources of information in datasets and to serve as baseline for follow-up work. We first introduce a simple eye gaze estimation network (\({\text {EyeNet}}_{\text {static}}\)) and its recurrent counterparts (\({\text {EyeNet}}_{\text {RNN}}\), \({\text {EyeNet}}_{\text {LSTM}}\), \({\text {EyeNet}}_{\text {GRU}}\)) for the task of per-frame or temporal gaze and pupil size estimation (see Fig. 3a). As the EVE dataset contains synchronized visual stimuli, we propose a novel technique to process these initial eye-gaze predictions further by taking the raw screen content directly into consideration. To this end, we propose the GazeRefineNet architecture (Fig. 3b), and describe its details in the second part of this section.

4.1 EyeNet Architecture

Learning-based eye gaze estimation models typically output their predictions as a unit direction vector or in Euler angles in the spherical coordinate system. The common metric to evaluate the goodness of predicted gaze directions is via an angular distance error metric in degrees. Assuming that the predicted gaze direction is represented by a 3-dimensional unit vector \(\hat{\mathbf {g}}\), the calculation of the angular error loss when given ground-truth \(\mathbf {g}\) is then:

$$\begin{aligned} \mathcal {L}_\mathrm {gaze}\left( \mathbf {g},\,\hat{\mathbf {g}}\right) = \frac{1}{NT}\sum ^N\sum ^T \frac{180}{\pi } \arccos \left( \frac{\mathbf {g}\cdot \hat{\mathbf {g}}}{\Vert \mathbf {g}\Vert \Vert \hat{\mathbf {g}}\Vert }\right) \end{aligned}$$
(1)

where a mini-batch consists of N sequences each of length T.

To calculate PoG, the predicted gaze direction must first be combined with the 3D gaze origin position \(\mathbf {o}\) (determined during data pre-processing), yielding a gaze ray with 6 degrees of freedom. We can then intersect this ray with the screen plane to calculate the PoG by using the camera transformation with respect to the screen plane. Pixel dimensions (our \(1920\times 1080\) screen 553 mm wide 311 mm tall) can be used to convert the PoG to pixel units for an alternative interpretation. We denote the predicted PoG in centimeters as \(\hat{\mathbf {s}}\).

Assuming that the pupil size can be estimated, we denote it as \(\hat{\mathbf {p}}\) and define an \(\ell _1\) loss given ground-truth \(\mathbf {p}\) as:

$$\begin{aligned} \mathcal {L}_\mathrm {pupil}\left( \mathbf {p},\,\hat{\mathbf {p}}\right) = \frac{1}{NT}\sum ^N\sum ^T {\Vert \mathbf {p}-\hat{\mathbf {p}}\Vert }_{1} \end{aligned}$$
(2)

The two values of gaze direction and pupil size are predicted by a ResNet-18 architecture [17]. To make the network recurrent, we optionally incorporate a RNN [46], LSTM [18], or GRU [10] cell.

4.2 GazeRefineNet Architecture

Given the left and right eye images \(\mathbf {x}_l\) and \(\mathbf {x}_r\) of a person, we hypothesize that incorporating the corresponding screen content can improve the initial PoG estimate. Provided that an initial estimate of PoG \(\tilde{\mathbf {s}} = f\left( \mathbf {x}\right) \) can be made for the left and right eyes \(\tilde{\mathbf {s}}_l\) and \(\tilde{\mathbf {s}}_r\) respectively, we first take the average of the predicted PoG values with \(\tilde{\mathbf {s}}=\frac{1}{2}\left( \tilde{\mathbf {s}}_l+\tilde{\mathbf {s}}_r\right) \) to yield a single estimate of gaze. Here f denotes the previously described EyeNet. We define and learn a new function, \(\mathbf {s} = g\left( \mathbf {x}_S, \tilde{\mathbf {s}} \right) \), to refine the EyeNet predictions by incorporating the screen content and temporal information. The function g is parameterized by a fully convolutional neural network (FCN) to best preserve spatial information. Following the same line of reasoning, we represent our initial PoG estimate \(\tilde{\mathbf {s}}\) as a confidence map. More specifically, we use an isotropic 2D Gaussian function centered at the estimated gaze position on the screen. The inputs to the FCN are concatenated channel-wise.

To allow the model to better exploit the temporal information, we use an RNN cell in the bottleneck. Inspired by prior work in video-based saliency estimation, we adopt a convolutional recurrent cell [28] and evaluate RNN [46], LSTM [18], and GRU [10] variants.

The network optionally incorporate concatenative skip connections between the encoder and decoder layers, as this is shown to be helpful in FCNs. We train the GazeRefineNet by using pixel-wise binary cross-entropy loss on the output heatmap and MSE loss on the final numerical estimate of the PoG. It is calculated in a differentiable manner via a soft-argmax layer [5, 19]. The PoG is converted to centimeters to keep the loss term from exploding (due to its magnitude). Please refer to Fig. 3b for the full architecture diagram, and our supplementary materials for implementation details.

Offset Augmentation. In the task of cross-person gaze estimation, it is common to observe high discrepancies between the training and validation objectives. This is not necessarily due to overfitting or non-ideal hyperparameter selections but rather due to the inherent nature of the problem. Specifically, every human has a person-specific offset between their optical and visual axes in each eye, often denoted by a so-called Kappa parameter. While the optical axis can be observed by the appearance of the iris, the visual axis cannot be observed at all as it is defined by the position of the fovea at the back of the eyeball.

During training, this offset is absorbed into the neural network’s parameters, limiting generalization to unseen people. Hence, prior work typically incur a large error increase in cross-person evaluations (\(\sim 5^\circ \)) in comparison to person-specific evaluations (\(\sim 3^\circ \)). Our insight is that we are now posing a gaze refinement problem, where an initially incorrect assessment of offset could actually be corrected by additional signals such as that of screen content. This is in contrast with the conventional setting, where no such corrective signal is made available. Therefore, the network should be able to learn to overcome this offset when provided with randomly sampled offsets to a given person’s gaze.

This randomization approach can intuitively be understood as learning to undo all possible inter-personal differences rather than learning the corrective parameters for a specific user, as would be the case in traditional supervised personalization (e.g., [37]). We dub our training data augmentation approach as an “offset augmentation”, and provide further details of its implementation in our supplementary materials.

5 Results

In this section, we evaluate the variants of EyeNet and find that temporal modelling can aid in gaze estimation. Based on a pre-trained \({\text {EyeNet}}_{\text {GRU}}\), we then evaluate the effects of our contributions in refining an initial estimate of PoG using variants of GazeRefineNet. We demonstrate large and consistent performance improvements even across camera views and visual stimulus types.

Table 2. Cross-person gaze estimation and pupil size errors of EyeNet variants, evaluated on the test set of EVE. The GRU variant performs best in terms of both gaze and pupil size estimates
Table 3. An ablation study of our contributions in GazeRefineNet, where a frozen and pre-trained \({\text {EyeNet}}_{\text {GRU}}\) is used for initial gaze predictions. Temporal modelling and our novel offset augmentation both yield large gains in performance.

5.1 Eye Gaze Estimation

We first consider the task of eye gaze estimation purely from a single eye image patch. Table 2 shows the performance of the static \({\text {EyeNet}}_{\text {static}}\) and its temporal variants (\({\text {EyeNet}}_{\text {RNN}}\), \({\text {EyeNet}}_{\text {LSTM}}\), \({\text {EyeNet}}_{\text {GRU}}\)) on predicting gaze direction, PoG, and pupil size. The networks are trained on the training split of EVE. Generally, we find our gaze direction error values to be in line with prior works in estimating gaze from single eye images [53], and see that the addition of recurrent cells improve gaze estimation performance modestly. This makes a case for training gaze estimators on temporal data, using temporally-aware models, and corroborates observations from a prior learning-based gaze estimation approach on natural eye movements [47].

Pupil size errors are presented in terms of mean absolute error. Considering that the size of pupils in our dataset vary 2 mm to 4 mm, the presented errors of 0.3 mm should allow for meaningful insights to be made in fields such as the cognitive sciences. We select the GRU variant (\({\text {EyeNet}}_{\text {GRU}}\)) for the next steps as it shows consistently good performance for both eyes.

Table 4. Improvement in PoG prediction (in px) of our method in comparison with two saliency-based alignment methods, as evaluated on the EVE dataset.
Table 5. Final gaze direction errors (in degrees, lower is better) from the output of \({\text {GazeRefineNet}}_{\text {GRU}}\), evaluated on the EVE test set in cross-stimuli settings. Indicated improvements are with respect to initial PoG predictions (mean of left+right) from \({\text {EyeNet}}_{\text {GRU}}\) trained on specified source stimuli types.

5.2 Screen Content Based Refinement of PoG

GazeRefineNet consists of a fully-convolutional architecture which takes as input a screen content frame, and an offset augmentation procedure at training time. Our baseline performance for this experiment is different to Table 2 as gaze errors are improved when averaging the PoG from the left and right eyes, with according adjustments to the label (averaged in screen space). Even with the new competitive baseline from PoG averaging, we find in Table 3 that each of our additional contributions yield large performance improvements, amounting to a 28% improvement in gaze direction error, reducing it to \(2.49^\circ \). While not directly comparable due to differences in setting, this value is lower even than recently reported performances of supervised few-shot adaptation approaches on in-the-wild datasets [29, 37]. Specifically, we find that the offset augmentation procedure yields the greatest performance improvements, with temporal modeling further improving performance. Skip connections between the encoder and decoder do not necessarily help (except in the case of \({\text {GazeRefineNet}}_{\text {RNN}}\)), presumably because the output relies mostly on information processed at the bottleneck. We present additional experiments of GazeRefineNet in the following paragraphs, and describe their setup details in our supplementary materials.

Comparison to Saliency-Based Methods. In order to assess how our GazeRefineNet approach compares with existing saliency-based methods, we implement two up-to-date methods loosely based on [1] and [48]. First, we use the state-of-the-art UNISAL approach [12] to attain high quality visual saliency predictions. We accumulate these predictions over time for the full exposure duration of each visual stimulus in EVE (up to 2 min), which should provide the best context for alignment (as opposed to our online approach, which is limited to 3 seconds of history). Standard back propagation is then used to optimize for either scale and bias in screen-space (similar to [1]) or the visual-optical axis offset, kappa (similar to [48]) using a KL-divergence objective between accumulated visual saliency predictions and accumulated heatmaps of refined gaze estimates in the screen space. Table 4 shows that while both saliency-based baselines perform respectably on the well-studied image and video stimuli, they fail completely on wikipedia stimuli despite the fact that the saliency estimation model was provided with full 1080p frames (as opposed to the \(128\times 72\) input used by \({\text {GazeRefineNet}}_{\text {GRU}}\)). Furthermore, our direct approach takes raw screen pixels and gaze estimations up to the current time-step as explicit conditions and thus is a simpler yet explicit solution for live gaze refinement that can be learned end-to-end. Both the training of our approach and its large-scale evaluation is made possible by the EVE, which should allow for insightful comparisons in the future.

Fig. 4.
figure 4

Qualitative results of our gaze refinement method on our test set, where PoG over time are colored from blue-to-red (old-to-new). It can be seen that GazeRefineNet corrects offsets between the initial prediction and ground-truth.

Cross-Stimuli Evaluation. We study if our method generalizes to novel stimuli types, as this has previously been raised as in issue for saliency-based gaze alignment methods (such as in [42]). In Table 5, we confirm that indeed training and testing on the same stimulus type yields the greatest improvements in gaze direction estimation (shown in diagonal of table). We find in general that large improvements can be observed even when training solely on video or wikipedia stimuli types. One assumes that this is the case due to the existence of text in our video stimuli and the existence of small amounts of images in the wikipedia stimulus. In contrast, we can see that training a model on static images only does not lead to good generalization on the stimuli types.

Qualitative Results. We visualize our results qualitatively in Fig. 4. Specifically, we can see that when provided with initial estimates of PoG over time from \({\text {EyeNet}}_{\text {GRU}}\) (far-left column), our \({\text {GazeRefineNet}}_{\text {GRU}}\) can nicely recover person-specific offsets at test time to yield improved estimates of PoG (center column). When viewed in comparison with the ground-truth (far-right column), the success of \({\text {GazeRefineNet}}_{\text {GRU}}\) in these example cases is clear. In addition, note that the final operation is not one of pure offset-correction, but that the gaze signal is more aligned with the visual layout of the screen content post-refinement.

6 Conclusion

In this paper, we introduced several effective steps towards increasing screen-based eye-tracking performance even in the absence of labeled samples or eye-tracker calibration from the final target user. Specifically, we identified that eye movements and the change in visual stimulus have a complex interplay which previous literature have considered in a disconnected manner. Subsequently, we proposed a novel dataset (EVE) for evaluating temporal gaze estimation models and for enabling a novel online PoG-refinement task based on raw screen content. Our GazeRefineNet architecture performs this task effectively, and demonstrates large performance improvements of up to \(28\%\). The final reported angular gaze error of \(2.49^\circ \) is achieved without labeled samples from the test set.

The EVE dataset is made publicly availableFootnote 2, with a public web server implemented for consistent test metric calculations. We provide the dataset and accompanying training and evaluation code in hopes of further progress in the field of remote webcam-based gaze estimation. Comprehensive additional information regarding the capture, pre-processing, and characteristics of the dataset is made available in our supplementary materials.

figure a