5.1 Fixation Directions and Distances
Figure
5 shows the distributions of gaze directions relative to the head for the four video games. The distributions from one game to the next are quite similar. They are narrow and nearly isotropic because there were few fixations that deviated more than 5
\(^{\circ }\) from straight ahead. The narrow distribution of fixation directions in HMDs has been reported by others who have hypothesized, as we do, that people tend to make small eye movements and large head movements due to the restricted field of view in HMDs compared to natural viewing [Sidenmark and Gellersen
2019; Sitzmann et al.
2018; Pfeil et al.
2018] (see Section
8.1). Additionally, the Vive Pro Eye HMD uses Fresnel lenses, characterized by an unsmooth grooved surface. Such lenses yield poorer optical quality in the periphery than in the center of the display. Thus, to maximize image quality near the fovea, participants may have turned the head rather than the eyes to avoid fixating regions of poor quality.
The fact that fixation directions are concentrated near straight ahead in the VR-gaming environment is useful information for foveated rendering applied to video games [Guenter et al.
2012; Patney et al.
2016; Albert et al.
2017]. Specifically, one might achieve more compute-time benefit than achieved with rendering coupled with eye tracking by not doing eye tracking and simply expanding the sharply rendered region to cover the great majority of fixation directions: roughly the central 10
\(^{\circ }\) (diameter).
Figure
6 shows the distributions of fixation distances for the four games. There are many distant fixations in all but the
Environmental game. The modes of the distributions in the
Rhythm,
First-person Shooter, and
Action-Rhythm games are close to 0 diopters D, which corresponds to distant gaze for which the eyes’ visual axes are parallel or nearly so. We examine the consequences of the tendency to fixate far in Section
5.2.
When a person looks at a near object off to the left or right, the object is closer to one eye than the other creating a larger retinal image in the closer eye. When the object is also up or down, the person must make a vertical vergence movement to fixate the object accurately [Schor et al.
1994] and this can produce discomfort [Kane et al.
2012] (Section
4.6). Figures
5 and
6 show that this combination of near gaze in an oblique direction is quite rare in the VR-gaming environment. Thus, the vertical disparities experienced in that environment are generally quite small and probably not problematic.
Our main purpose in examining fixations in the VR environment is to determine how they compare to natural fixation behavior. Figure
7 enables the comparison by plotting both the VR data and data from natural viewing in the real world. The natural data were obtained from the BORIS dataset (
https://github.com/Berkeley-BORIS) using methods described in Sprague et al. [
2015] and Gibaldi and Banks [
2019]. Those data are the weighted average across six everyday tasks and four subjects. The VR data are the average across the four games and 10 subjects.
The upper panels of Figure
7 plot the distributions of fixation directions from these averages. In the natural environment, the direction of gaze is most commonly straight ahead and slightly down relative to primary position. Secondary directions—leftward, rightward, upward, and downward—are the next most common [Sprague et al.
2015; Gibaldi and Banks
2019; Kothari et al.
2020; Tatler and Vincent
2008]. There are few gaze directions more than 15
\(^{\circ }\) from straight ahead because when people attempt to look at more eccentric points they usually execute a combined eye and head rotation [Barnes
1979; Guitton and Volle
1987; Pfeil et al.
2018]. The distribution of fixation distances in the VR environment is much narrower and more isotropic. The great majority of fixations is within 5
\(^{\circ }\) of straight ahead.
The lower panels of Figure
7 plot the distributions of fixation distances averaged across games and tasks. In the natural environment, we observe a broad distribution of distances with a median value of
\({\sim }70\) cm (1.5D); that distance is indicated by the solid red line. Of course, the distance of gaze varies significantly from one everyday task to another (Supplementary Figure S2). When walking outdoors, the most common distance is
\({\sim }500\) cm (0.2D). When making a sandwich, the most likely distance is
\({\sim }62\) cm (1.6D). The distribution of distances in the VR environment is generally farther than in the natural environment. The median VR value is
\({\sim }125\) (0.8D), which is indicated by the solid red line. The distances vary from one game to another (Figure
6), but are generally farther than in the natural environment. We consider the significance of this tendency to fixate far in Section
5.2.
5.2 Screen Distance and VA Conflict
Vergence and accommodation are negative-feedback control systems [Fincham and Walton
1957; Cumming and Judge
1986; Schor
1992]. The vergence part takes disparity as input and generates converging or diverging eye movements to null the disparity at the fovea. The accommodation part takes retinal blur as input and adjusts focus to minimize the blur. The vergence and accommodation parts of the control system work to drive their respective outputs to the same distance in the environment, so it makes sense that they communicate with one another through neural cross-links. Because of the cross-links, the act of converging or diverging causes the eye lens to change power (vergence accommodation) and the act of accommodating nearer or farther causes vergence movements (accommodative vergence). The cross-coupling increases speed and accuracy in the natural environment [Cumming and Judge
1986].
The cross-coupling is, however, counter-productive for viewing stereoscopic displays such as HMDs. In such displays, vergence must be to the distance of the virtual object of interest for a single, fused image to be seen. But the light comes from the display screen so accommodation must be to the screen distance for a sharp image to be seen. Thus, the distances for appropriate vergence and appropriate accommodation are often quite different. The difference is the
vergence-accommodation conflict. When the conflict is non-zero, the visual system must work against the cross-coupling to fuse and sharpen the images. Larger conflicts cause greater deficits in perceptual performance, and considerable discomfort [Akeley et al.
2004; Watt et al.
2005; Hoffman et al.
2008; Shibata et al.
2011; Mauderer et al.
2014; Koulieris et al.
2017].
Current best practices in content development for HMDs recommend presenting virtual content at a distance similar to the optical distance of the screen in order to minimize discomfort due to the vergence-accommodation conflict [Oculus VR
2017]. We used our measurements of content and fixation statistics during game play to determine the distribution of the vergence-accommodation conflicts. Specifically, we used the distribution of fixation distances to determine how frequently those vergence distances would be nearer or farther than the optical distance of the screen by
\(\pm\)0.5D, thereby creating a conflict large enough to cause discomfort [Shibata et al.
2011]. Figure
8 shows the results. The left panel shows the percentage of fixations at various distances, averaged across the games and subjects; it is similar to the lower left panel of Figure
7. The median fixation distance is represented by the vertical red line. The right panel shows the percentage of fixations that are associated with conflicts greater than
\(\pm\)0.5D, as a function of screen distance. The screen distance in the Vive Pro Eye is indicated by the vertical blue line. The dashed green line represents the screen distance that would minimize conflicts. Obviously, it is much farther than the actual distance to the screen. Thus, discomfort due to vergence-accommodation conflicts would be reduced by nearly tripling the screen distance to 196 cm (0.51D). (The screen distances of other commercial devices (e.g., Oculus DK1, DK2, and CV1; HoloLens 1 and 2) are greater, but in most cases still not far enough to minimize conflict). Of course, the degree of mismatch will depend strongly on the specific demands of the virtual environment and task. Designers of HMDs and video games can use our data to better match screen and fixation distance to improve viewer comfort and performance [Koulieris et al.
2017].
5.3 Disparity Statistics
Figure
9 shows the median horizontal disparities at the retina for the four video games. As noted earlier (Section
4.6), the disparities are expressed in Helmholtz retinal coordinates. To determine disparities in those coordinates, we needed to know both the 3D scene geometry and where participants fixated in those scenes. The individual panels plot median disparity for each position in the visual field. Negative values (blue) correspond to uncrossed disparities (farther than fixation) and positive values (yellow) to crossed (nearer than fixation). In each panel, the fovea is in the center and the upper and left visual fields are at the top and left, respectively. The distributions vary across the four games. The
Environmental,
First-Person Shooter, and
Action-Rhythm games generate a relatively small range of disparity with a trend from crossed in the lower field to uncrossed in the upper. The
Rhythm game produced a much larger range with large uncrossed disparities a few degrees from the fixation point and no trend from crossed to uncrossed from the lower to the upper field. From these data it is clear, unsurprisingly, that the distribution of disparities across the visual field depends on the game being played.
Our main purpose in measuring the disparities encountered in the VR environment is to determine how they compare to the disparities experienced in the natural environment. Figure
10 enables the comparison by plotting both the VR data and data from natural viewing in the real world. As stated earlier, the natural data were obtained from the BORIS dataset using methods described in Sprague et al. [
2015] and Gibaldi and Banks [
2019]. Those data are the weighted average across six everyday tasks and four subjects. The VR data are the average across the four games and 10 subjects. The right panels reveal clear regularities in naturally occurring disparities. The upper right panel shows median horizontal disparities across the visual field. There is a striking change from the lower to the upper field. The median disparity in the lower field is positive (crossed) while the median disparity in the upper field is negative (uncrossed). These are large tendencies. For example, 10
\(^\circ\) above fixation, 70% of disparities are negative. The top-back pitch of the data is highlighted in the lower right panel, which shows the median and range of disparity from the lower to the upper field. Thus, given where people tend to fixate, the natural environment creates a pattern of disparities that is slanted top back. The natural data also exhibit a systematic change from the left to the right field. Median disparity changes from negative (uncrossed) on the left to zero near the fovea to negative again on the right.
For humans to perceive depth from disparity, the visual system must determine which points in the left-eye’s image correspond to points in the right-eye’s image. The visual system utilizes the environmental regularities mentioned earlier to solve this binocular correspondence problem. Specifically, the search for disparity in a given location in the visual field is centered on corresponding retinal points. The definition of corresponding points is the following. For every retinal location in one eye there is a location in the other eye that forms a pairing with special status in binocular vision. These pairs are corresponding retinal points. Rays projected from those corresponding-point pairs intersect in the world on a surface called the
binocular horopter [Ogle
1950; von Helmholtz
2013]. The horopter is pitched top back [Nakayama
1977; Siderov et al.
1999; Cooper et al.
2011]. So, for objects above current fixation to fall on the horopter they must be farther than fixation, while objects below fixation must be nearer. The horopter is also farther on the left and right (relative to the zero-disparity surface) than in the center.
Why is the horopter important? Binocular vision is best for objects on or near the horopter: fusion is guaranteed and depth discrimination is most precise [Brewster
1844; Prince and Eagle
2000; Vlaskamp et al.
2013; Blakemore
1970; Schumer and Julesz
1984; Fischer
1924; Ogle
1950]. Importantly, the shape of the horopter is quite similar to the central tendency of the natural-disparity statistics (Figure
10). Therefore, fusion and accurate stereopsis are guaranteed for the most likely natural scenes.
The disparity statistics are also relevant to oculomotor behavior. When people make upward saccadic eye movements to a stimulus whose distance is ambiguous, their eyes diverge and when they make downward saccades their eyes converge [Zee et al.
1992; Gibaldi and Banks
2019; Enright
1984; Collewijn et al.
1988]. These vergence biases are consistent with natural-disparity statistics. Consequently, the biases ensure that when the eyes land at the end of a saccade in the real world they will be fixating the most likely distance of the new target. This speeds up visual processing because it minimizes the likelihood of having to make another vergence movement to accurately fixate the new target.
For these reasons, it is very important that the horopter and oculomotor biases are compatible with the statistics of the natural environment. Otherwise, these biases would be counter-productive.
Now consider the disparities in the VR-gaming environment. The upper left panel of Figure
10 shows median disparities in retinal coordinates across the visual field in that environment. The median disparities are qualitatively similar to those from the natural environment. The VR statistics exhibit a bottom-to-top change from positive to negative disparity (near to far) and the left-to-right change from negative to zero and back to negative. But these changes are smaller and less systematic in the VR environment than in the natural. We highlight this in Figure
11, which plots the difference between the median disparities (natural–VR) for each position in the visual field. There is a prominent difference in the lower field where disparity is decidedly more positive in the natural than in the VR environment. Unlike the natural-environment data, the bottom-to-top change in the VR data is not large enough to match the horopter’s pitch. And the left-right change is not large enough to match the horopter’s horizontal curvature. We hypothesize that solving the binocular correspondence problem, obtaining fusion, achieving precise stereo vision, and making accurate vergence during saccadic eye movements are compromised in the VR-gaming environment.
We next examined the variability of disparity in the two environments (Figure
12). In the natural environment (right panel), the standard deviation increases roughly in proportion to eccentricity from a value close to 0
\(^{\circ }\) at the fovea to 60–80 arcmin at an eccentricity of 10
\(^{\circ }\). This systematic change in disparity variation is reflected in the functional structure of the binocular visual system. The range of disparities that produce a fused image (i.e., not a double image) grows in proportion to retinal eccentricity [Ogle
1950; Hampton and Kertesz
1983]. The standard deviation in the VR environment increases more with eccentricity than in the natural environment, particularly in the left and right visual fields. We explored an implication of this finding by calculating from the disparity statistics the probability of experiencing double vision across the visual field. To do this, we modeled Panum’s fusion area (the range of fusable disparities) using data from previous psychophysical experiments [Ames et al.
1932; Ogle
1950]. We then collated data on the shape of the horopter [Cooper et al.
2011; Grove et al.
2001; Schreiber et al.
2008; Nakayama
1977; Gibaldi and Banks
2019]. We centered the range of fusable disparities on the horopter. We then created a smooth 3D surface that best fit the horopter data:
where
X and
Y are Helmholtz azimuths and elevations in degrees, and
\(D_H\) is the horizontal disparity of the surface, also in degrees. We used a similar method to model Panum’s fusion area [Ames et al.
1932; Hampton and Kertesz
1983; Ogle
1950]. The equation providing the best fit is
where
\(\epsilon\) is eccentricity of the visual direction in degrees:
\(\epsilon = \sqrt {X^2 + Y^2}\). We then calculated for each field position the proportion of observed disparities that would fall outside of the fusable range. The results for the VR-gaming and natural environments are plotted in the left and right panels of Figure
13, respectively. Clearly, the proportion of disparities that could produce double vision is greater in the VR environment, particularly in the left and right fields.
We also observe that the spread of horizontal disparity in the natural environment is much greater than the spread of vertical disparity. Specifically, the aspect ratio of the joint distribution of horizontal and vertical disparity is
\(\sim\)20:1. This statistical property is manifest in the binocular visual system. For example, cortical neurons in primates have much more variation in their preferred horizontal disparity than in their preferred vertical disparity [Cumming
2002; Durand et al.
2007]. Furthermore, when presented stereoscopic stimuli in which the direction of disparity (e.g., horizontal, vertical, or oblique) is ambiguous, humans exhibit a strong bias to assume that the direction is horizontal [Van Ee and Schor
2000; Rambold and Miles
2008]. The spread of horizontal disparity relative to that of vertical disparity in the VR-gaming environment is
\(\sim\)16:1, which is quite similar to the natural ratio. Thus, this aspect of disparity in the virtual environment is consistent with natural statistics.