Perceptual models and visibility predictors have a wide range of applications in computer graphics and related fields. The most simple applications include visualization and evaluation of algorithms for processing and creating visual content, e.g., rendering and compression. More advanced techniques leverage perceptual models while optimizing visual content. Due to the efficiency of our model, it can be used in both scenarios.
Implementation and visual evaluation. Our model (Section
5) can be used to visualize the visibility of temporal changes in a video, given the gaze location from an eye tracker. Since the model operates on 71
\(\times\) 71
\(\times\) 25 spatio-temporal patches, to provide the prediction for a video, we divide the video into nonoverlapping patches of this size. For each patch, the prediction can be computed and presented in the form of a heatmap visualizing the probability of detecting the temporal changes for each spatio-temporal location. We show a sample map of change detection maps for a natural video of an ocean with waves in Figure
8. The computed probabilities show a declining trend as the distance from the gaze location increases. This trend is mostly attributed to the behavior observed in the HVS, which is the loss of spatio-temporal sensitivity as the retinal eccentricity increases (also observable in our model fit in Figure
5). The processing time is 5.5 min for a 5 s 120 FPS 4K video (unoptimized parallel implementation using Python 3.6, NumPy 1.19.3, SciPy 1.5.0, OpenCV-python 4.5.1.48 on 3.6 GHz 8-core Intel Core i7-9700K CPU). The largest portion of the computational cost is incurred during the computation of DCT. Below, we provide two examples of use cases of our technique.
6.1 Imperceptible Transitions
Measuring visibility of temporal changes is important when the visibility has to be controlled within specific limits. For example, while designing graphical user interfaces for head-up or optical see-through displays, it is usually important to keep critical visual status updates more visible, whereas less critical updates should not interfere with the users’ task performance by grabbing their attention unnecessarily. Similar to visible difference predictors that are designed to improve perceived quality by keeping image distortions within specific visibility limits, outputs of our method may be used for improving visual task performance and promoting sustained visual attention by adjusting the temporal visibility based on importance.
For this application, we consider a task of introducing new content into an existing scene without causing distraction to a viewer. We propose to consider this as a problem of computing the fastest image transition that remains undetectable when it is applied to an input image sequence at a given visual eccentricity. When transitioning from a source image to a target image, if the transition is performed slowly, then the probability of detecting the visual change decreases. But using slow transitions limits how often visual information can be updated in the aforementioned applications for user interfaces or AR/VR headsets. It is possible to aim for a fast transition speed to complete the visual update in a short time, but that increases the probability of detecting the changes. A naive approach would be using a constant rate of transition not to exceed a desired probability of detection but that also requires a model to compute the probability for different transition speeds. We can perform such visual updates faster with our method, because we can compute the transition speed between source and target stimuli adaptively depending on underlying content. Moreover, we can keep the probability of change detection constant over the course of the transition, making it perceptually stable.
Our method takes as input a source image (\(I_s\)), a target image (\(I_t\)), and a blending function \(\phi (I_s, I_t, \alpha)\), which for \(\alpha \in [0,1]\) provides a continuous transition between the two input images. Additionally, the input includes a user-chosen level of temporal change detection probability (\(p_d\)) and an eccentricity (\(e\)) at which the transition should occur. Based on the input, the method computes \(\lbrace \alpha _i \rbrace _{i=1}^{N}\), such that a viewer detects the sequence of images \(\lbrace I_n = \phi (I_s, I_t, \alpha _i)\rbrace _{i=1}^{N}\) shown at the eccentricity \(e\) with the probability \(p_d\). The tasks is accomplished by computing the amount of increments \(\Delta \alpha _n = \alpha _{n} - \alpha _{n-1}\) that satisfies the level of detection probability, \(P_n(\text{detection} | \Delta \alpha _n) = p_d, \forall n\) at each frame update.
To solve this problem, we use greedy optimization. We start with
\(\alpha _0 = 0.0\) and compute the step sizes
\(\Delta \alpha _n\) that we should take to increment
\(\alpha _n\) at each frame to satisfy
\(P_n(\text{detection} | \Delta \alpha _n) = p_d\). To compute
\(\Delta \alpha _n\), we apply our method to non-overlapping temporal windows of 25 video frames generated using the image blending
\(\phi\) and solve for the following minimization:
where
\(P_n(\text{detection} | \Delta \alpha _n)\) is computed using our visibility model. To solve the above optimization problem, we apply Brent’s root-finding algorithm.
Our model was calibrated using a spatial window size of
\(71\times 71\) pixels. To compute the probability for larger image patches, we split them into smaller non-overlapping subwindows of size
\(71\times 71\), and solve the optimization (Equation (
12)) for each of them separately. We then apply the max-pooling strategy, which assumes that the visibility of the temporal changes in the bigger window is determined by the sub-window with the most visible changes. Consequently, we set the
\(\alpha _n\) to the minimum across the sub-windows.
Figure
9 demonstrates an example of running our optimization on a pair of cat and dog images for retinal eccentricities
\(e \in \lbrace 0, 10, 20, 30 \rbrace\) and detection probabilities
\(p_d \in \lbrace 0.1, 0.3, 0.5, 0.7, 0.9 \rbrace\). As the blending function here and in other our experiments, we used a linear blending
\(\phi (I_s, I_t, \alpha _i) = (1-\alpha _i) I_s + \alpha _i I_t\). The plots at the top of the figure visualize how the step sizes (
\(\Delta \alpha _n\)) change depending on eccentricity and target probability. From these plots, we observe that more rapid interpolations between two images result in a higher probability of temporal change visibility. In addition, the interpolation speed defined by
\(\Delta \alpha _n\) is not usually uniform over time, and we see slow-downs or speed-ups depending on the image content.
The sequence of steps (
\(\Delta \alpha _n\)) is valid only for one eccentricity value. In practice, the viewer is most likely constantly changing their gaze location, and the step sequence has to adapt to the current eccentricity value to maintain the constant level of the transition visibility. To this end, our method precomputes and stores the set of sequences (
\(\Delta \alpha _n\)) for a finite set of different eccentricities (Figure
9, top) and by smoothly interpolating between them use the sequence that corresponds to the current eccentricity. This enables a dynamic adaptation to the current gaze location. The transition slows down when the viewer’s gaze is closer to the position at which the transition occurs, and conversely it speeds up when the gaze moves away. Please see our supplemental materials for experiencing the effect.
To evaluate our technique, we conducted a subjective experiment in which we analyzed how the optimized content impacts the participants’ gaze patterns. More specifically, we were interested in validating a relation between the optimized probability of detection and eye movements towards the changing patterns. In each trial of the experiment, participants were shown a full-screen image containing a grid of cats and dogs images (Figure
9). After a brief delay, five random patches started alternating between a cat and a dog image according to previously optimized probabilities. Participants were asked to look at the region of the image that draws their attention due to temporal changes (please see the experiment protocol in Figure
9). Each trial finished as soon as the participant’s gaze reached the position of one of the five changing patches. During the trial, the participant could freely move their gaze as the method was adapting the transitions according to the current gaze location. Twelve participants (ages between 21 and 32) took part in the experiment conducted on an Acer X27 display at
\(3,840 \times 2,160\) resolution and 120 Hz refresh rate using Tobii Pro Spectrum eye tracker to monitor the gaze location. The experiment consisted of 30 trials for each participant and took approximately 5 min to complete.
We run two versions of this experiment. In the first version, we picked the detection probabilities of 5 pairs of patches uniformly as
\(A = \lbrace 0.1, 0.3, 0.5, 0.7, 0.9 \rbrace\) (All probabilities). In the second one, we used a subset of lower probabilities
\(L = \lbrace 0.1, 0.3, 0.5 \rbrace\) (Low probabilities) while keeping the number of the simultaneously changing patches during each trial the same (5). Figure
10 contains a histogram of change detection probabilities (
\(p_d\)) versus the number of trials in which they have attracted the gaze of the participants.
In the experiment that we tested with
\(p_d \in A\), we observe that the temporal changes with
\(p_d = 0.9\) were chosen the most frequently by the participants, while this number declines rapidly as
\(p_d\) decreases (Figure
11, pink bars). We see a similar trend in the experiment with the set of
\(p_d \in L\), where the participants similarly shift their gaze to the temporal change with the highest probability of detection in the set
\(L\) (
\(p_d = 0.5\)) (Figure
11, blue bars). These results demonstrate that, indeed, the higher the probability predicted by our method, the more likely the patch will attract the participant’s gaze.
To further investigate the difference between the experiment with low and high probabilities, (Figure
11) provides the time passed from the start of each trial until the participant’s gaze shifts to one of the patches with temporal changes. The medians of the times are different in two experiments (
\(p \lt 0.001\)—Wilcoxon rank-sum test). The average time that we measured in the experiment with all probabilities is
\(\mu _A = 1.8951 s\) (
\(CI_{95\%}: [1.5677, 2.2793]\)) while the average time from the low probabilities is
\(\mu _L = 7.1956 s\) (
\(CI_{95\%}: [6.4563, 8.2947]\)) (Figure
11). This observation suggests that although the participants shift their gaze to the patch with the highest
\(p_d\) shown in both experiments, there is a significant increase in the average response time, possibly due to a higher level of cognitive effort required to detect the temporal change when
\(p_d\) is small. We postulate that the shorter time for the experiment with all probabilities results from the fact that there were clearly visible transitions that were immediately visible to the subjects. While in the second experiment, the visibility levels were much closer to the threshold, and the participants needed more time to localize these transitions. Consequently, besides showing the effectiveness of our optimization method, this experiment further validates our model for predicting the detection probability of temporal changes in the periphery.
6.2 Temporal Aliasing in Foveated Rendering
An exciting application of our model is foveated rendering, which aims to reduce the shading rate, resolution, and bit depth to improve the rendering times or for image/video compression with minimal sacrifice of perceived quality [Browder and Chambers
1988; Glenn
1994; Tsumura et al.
1996; Kortum and Geisler
1996; Daly et al.
2001; Guenter et al.
2012]. We focus on foveated rendering applications with a lower shading rate in the periphery, which may lead to temporal aliasing. Temporal aliasing leads to deterioration in visual quality if not properly treated [Patney et al.
2016]. While some work has already considered modeling visibility of the foveation in static images [Tursun et al.
2019], there is no technique capable of predicting the visibility of the temporal artifacts. Our method is in particular suitable for such applications. If applied directly to foveated rendering content, then it can already predict visible temporal changes.
In our experiment, we implemented our own foveated rendering testbed in the Unity game engine (HDRP–2020.3.11f1) [
2021] with 3 eccentricity regions (i.e., fovea, near periphery, and far periphery) similar to Guenter et al. [
2012] (Table
3). Then, we rendered 5 s long videos of Amazon Bistro [Lumberyard
2017] and Crytek Sponza [McGuire
2017] models with a slow camera motion in the forward direction. In different rendering runs, we applied the following anti-aliasing methods in Unity to near- and far-peripheral regions:
(1)
Fast approximate anti-aliasing (FXAA) [Lottes
2009]
(2)
Subpixel morphological anti-aliasing (SMAA) [Jimenez et al.
2012] (quality preset: high)
(3)
Temporal anti-aliasing (TAA) [Korein and Badler
1983] (quality preset: high)
In addition to these anti-aliasing methods, we also rendered both models without applying any anti-aliasing (No AA) and computed the probability of temporal change detection from all videos using our method.
To measure the correlation of probabilities computed by our method and the visibility of any temporal artifacts in the output of anti-aliasing methods, we conducted a 2AFC subjective experiment, where the participants compared pairs of videos that we rendered. The same group of participants that has participated in the experiment of imperceptible transitions (Section
6.1) did this experiment. It consisted of 12 trials, and in each trial the participants were asked to watch a pair of anti-aliasing results from the same scene and choose the one with less flickering (Figure
12). The experiment was conducted on the same 55-inch LG OLED55CX, 120 Hz, 4K display that we used to calibrate our model due to its large field-of-view (Section
4). The pairwise comparison results from this subjective experiment was converted into
just-objectionable-difference (JOD) quality scores using Thurstonian scaling [Thurstone
1927; Perez-Ortiz and Mantiuk
2017]. The probability maps of temporal change detection from our method are pooled using Minkowski summation with exponent
\(\beta = 3\) to obtain a scalar score [Graham et al.
1978; Rohaly et al.
1997; To et al.
2011]. The histogram of the probabilities computed for each anti-aliasing method and a plot of the JOD scores computed from the subjective experiment versus pooled probabilities from our method are shown in Figure
13. We observe that FXAA and SMAA methods scored close to the rendering result with no anti-aliasing, whereas TAA turned out to be significantly superior for suppressing flickering in the periphery according to the subjective experiment results. The average probability of temporal change detection computed by our method is also in agreement with the results of subjective experiment (Pearson
\(\rho = -0.903\),
\(p = 0.002\)—t-test). Upon visual inspection, we also observe that the computed probability maps overall show higher probability of change detection for No AA, FXAA, and SMAA compared to TAA (please see the time-sliced images at the bottom row of Figure
13).
A direct application of our method to natural videos would detect the temporal changes that also arise from motion in the scene. Under some circumstances, it may be desirable to evaluate the potential aliasing due to only foveation. We also show an application that decouples the temporal changes due to motion in the scene and the aliasing. To this end, we warp the subsequent frames using motion flow vectors, effectively removing any motion, before applying our model (Figure
14). As it can be observed in the figure, when such compensation is not performed, the visibility of the aliasing is dominated by the motion. Both with and without anti-aliasing sequences produce similar visibility maps (top row). When the motion compensation is applied, only the effect of aliasing is detected by our method. Consequently, the prediction for the sequence with motion compensation and anti-aliasing does not include visible temporal changes.