Abstract
Visual scene perception is mediated by a set of cortical regions that respond preferentially to images of scenes, including the occipital place area (OPA) and parahippocampal place area (PPA). However, the differential contribution of OPA and PPA to scene perception remains an open research question. In this study, we take a deep neural network (DNN)-based computational approach to investigate the differences in OPA and PPA function. In a first step we search for a computational model that predicts fMRI responses to scenes in OPA and PPA well. We find that DNNs trained to predict scene components (e.g., wall, ceiling, floor) explain higher variance uniquely in OPA and PPA than a DNN trained to predict scene category (e.g., bathroom, kitchen, office). This result is robust across several DNN architectures. On this basis, we then determine whether particular scene components predicted by DNNs differentially account for unique variance in OPA and PPA. We find that variance in OPA responses uniquely explained by the navigation-related floor component is higher compared to the variance explained by the wall and ceiling components. In contrast, PPA responses are better explained by the combination of wall and floor, that is scene components that together contain the structure and texture of the scene. This differential sensitivity to scene components suggests differential functions of OPA and PPA in scene processing. Moreover, our results further highlight the potential of the proposed computational approach as a general tool in the investigation of the neural basis of human scene perception.
Keywords: Occipital Place Area, Parahippocampal Place Area, Deep Neural Networks, Functional Magnetic Resonance Imaging, Navigational Affordance model, Representational Similarity Analysis, Scene parsing
1. Introduction
Visual scene understanding is a fundamental cognitive ability that enables humans to interact with the components and objects present within the scene. Within the blink of an eye [Potter 1975, Thorpe et al. 1996, Li et al. 2007, Greene and Oliva, 2009] we know what type of scene we are in (e.g. kitchen, or outdoors), as well as its spatial layout and the objects contained in it.
Research on the neural basis of scene understanding has revealed a set of cortical regions with a preferential response to images of scenes over images of objects. These regions are the parahippocampal place area (PPA) [Epstein and Kanwisher, 1998], occipital place area (OPA) [Dilks et al., 2013, Hasson et al., 2003], and retrosplenial cortex (RSC) [O’Craven and Kanwisher, 2000]. To investigate the distinct function each of these place regions has, subsequent research has begun to tease apart their commonalities and differences in activation profile and representational content [Epstein and Kanwisher, 1998, Hasson et al., 2003, Dilks et al., 2013, Bonner and Epstein, 2017, Silson et al., 2015, O’Craven and Kanwisher, 2000]. However, a complete picture of how scene-selective regions together orchestrate visual scene understanding is still missing.
To gain further insights, a promising, but relatively less explored approach is computational modelling of brain activity. Recently, large advances have been made in modeling activity in visual cortex using deep neural networks (DNNs) trained on object categorization tasks [Krizhevsky et al., 2012] in both human and non-human primates [Yamins et al., 2014, Khaligh-Razavi and Kriegeskorte, 2014, Cichy et al., 2016]. Inspired by this success, researchers have also begun to use DNNs trained on scene categorization to investigate scene-selective cortex [Cichy et al., 2017, Bonner and Epstein, 2018, Groen et al., 2018].
In this process two issues have emerged that need to be addressed. First, while DNN trained on categorization tasks currently do best in predicting activity in scene-selective cortical regions, they do not account for all explainable variance. One particularly promising direction is the exploration of models trained on tasks different from categorization that might more closely resemble the brain region’s functionality, and thus predict brain activity better [Yamins et al., 2014, Cichy and Kaiser, 2019]. Second, it remains unclear what is the nature of the representations in the DNNs that gives them their predictive power. Thus, additional effort is needed to clarify what these representations are.
To address the above issues, we investigated neural activity in the scene-selective cortex using DNNs trained on scene parsing instead of categorization. A scene parsing task requires the DNN to predict the location and category of each scene component in the image. While the scene-categorization task requires only recognizing the scene category, the scene parsing task requires deeper scene understanding involving categorization as well as a grasp of the spatial organization of components and objects within the scene. In order to help interact with different objects and navigate within the scene, scene-selective brain regions should also encode the spatial organization of components within the scene. Therefore, we hypothesize that the scene parsing task is closer to the task the brain has to solve, and a DNN trained on scene parsing will predict brain activity better than a DNN trained on scene categorization.
To evaluate our hypothesis, we compared the power of DNNs trained on scene parsing versus categorization to predict activity in scene-selective cortical regions. For this we used an existing fMRI set of brain responses elicited by viewing scene images [Bonner and Epstein, 2017] and applied representational similarity analysis (RSA) to compare brain responses with DNNs. We found that scene parsing DNNs explain significantly more variance in brain responses uniquely in scene-selective regions than scene-classification DNNs.
We next investigated what representations in the DNNs trained on scene parsing gave the model its predictive power. For this we queried the DNN’s representations of different scene components, considering components that were present in all stimulus images: wall, floor, and ceiling. We showed that different scene components predict responses in OPA and PPA differently: floor explained more variance in OPA than wall and ceiling, while wall explained more variance in PPA than floor and ceiling. Importantly, results were consistent across three different DNN architectures, showing the generalizability of our claims across architectures.
In sum, our results reveal differential representational content in scene-selective regions OPA and PPA, and highlight DNNs trained on scene parsing as a promising model class for modelling human visual cortex with well interpretable output.
2. Materials and Methods
2.1. fMRI data
We used fMRI data from a previously published study by Bonner and Epstein [2017] where all experimental details can be found, as well as instructions on how to download the data. The fMRI data were collected from 16 participants on a Siemens 3.0T Prisma scanner with a 64-channel head coil. The participants were presented with images of indoor environments. The images were presented for 1.5s on the screen followed by a 2.5s interstimulus interval. The images presented in the experiment were from a stimulus set of 50 color images depicting indoor environments. During the fMRI scan, participants were asked to fixate on a cross all the time and press a button if the image presented to them was a bathroom. The task required participants to attend to each image and categorize it. Voxel-wise (voxel size = 2 × 2 × 2 mm) responses to each image during each scan run were extracted using a standard linear model.
We here focus on two scene-selective regions of interest (ROIs): PPA and OPA. PPA and OPA were identified from separate functional localizer scans using a contrast of brain responses to scenes larger than to objects and additional anatomical constraints. For both ROIs and all the subjects, each voxel’s responses in a given ROI were z-scored across images in a given run and then averaged across runs. The responses to a particular image were further z-scored across voxels.
2.2. Behavioral data
We used scene-related behavioral data representing navigational affordances assessed on the same stimulus set as used for recording the fMRI data described above [Bonner and Epstein, 2017]. To represent navigational affordances, a behavioral experiment was conducted in which 11 participants (different from the participants in the fMRI experiment) indicated the path to walk through each image of the indoor environment used in the fMRI study using a computer mouse. The probabilistic maps of paths for each image were created, followed by a histogram construction of navigational probability in one-degree angular bins radiating from the bottom center of the image. These histograms represent a probabilistic map of potential navigation routes from the viewer’s perspective. The resultant histogram is referred to as the Navigational Affordance Model (NAM).
2.3. DNN Models
We selected DNNs optimized on two different scene-related tasks: scene classification and scene parsing. We describe both types of models in detail below.
Scene-classification models
For solving a scene-classification task, a DNN model is optimized to predict the probabilities of the input image belonging to a particular scene-category. For comparison with neural and behavioral data, we considered DNNs pre-trained on the scene-classification task on the Places-365 dataset [Zhou et al., 2017]. Places-365 is a large scale scene-classification dataset consisting of 1.8 million training images from 365 scene categories. We selected multiple scene-classification DNN architectures to investigate if our results generalize across different architectures. For this purpose, we considered 3 standard architectures: Alexnet [Krizhevsky et al., 2012], Resnet-18 [He et al., 2016], and Resnet-50 [He et al., 2016] and downloaded pre-trained models from: https://github.com/CSAILVision/places365.
Alexnet consists of 5 convolutional layers (conv1-conv5) followed by 3 fully connected layers (fc6, fc7, and fc8). Both Resnet-18 and Resnet-50 consist of a convolutional layer followed by four residual blocks (block1 - block4) each consisting of several convolutional layers with skip connections leading to a final classification layer (fc). Resnet-18 consists of 18 layers and Resnet-50 consists of 50 layers in total and they differ in the number of layers within each block.
Scene parsing models
We used scene parsing models trained on ADE20k scene parsing dataset [Zhou et al., 2016]. The ADE20k dataset (publicly available at http://groups.csail.mit.edu/vision/datasets/ADE20K/) is a densely annotated dataset consisting of 25k images of complex everyday scenes with pixel-level annotations of objects (chair, bed, bag, lamp, etc.) and components (wall, floor, sky, ceiling, water, etc.). The images were annotated using the LabelMe interface [Russel et. al 2008] by a single expert human annotator.
For the first set of experiments, where we compare the predictive power of scene parsing models to scene-classification models for explaining the neuronal responses, we design scene parsing models such that their encoder architecture is taken from scene-classification models while their decoder architecture is task-specific. The encoder of the scene parsing models consists of the convolutional part (conv1-conv5 of Alexnet, and block1-block4 of Resnet18 and Resnet50) of scene classification models. The decoder of scene parsing models is adapted to the scene parsing task following the architecture proposed by Zhao et al. [2017]. It consists of a Pyramid Pooling module with deep supervision [Zhao et al., 2017] (d1), followed by a layer (d2) that predicts several spatial maps, one spatial map per scene component predicted, that represent the probability of the presence of that component at a given spatial location. The encoder weights of scene parsing models are initialized with the weights learned on the scene-classification task and decoder weights are initialized randomly. The scene parsing DNNs are then trained on ADE20k training data using a per-pixel cross-entropy loss which measures the performance of the classifier at each pixel whether the correct component is assigned the highest probability or not. The above procedure ensures that gain/drop in explaining neural/behavioral responses could only be due to additional supervision on the scene parsing task.
The aforementioned scene parsing DNNs are well suited for a direct comparison with the scene categorization DNNs as they have the same encoder architecture and were initialized with weights learned on the scene-categorization task. However, they are not comparable to state-of-the-art models in terms of accuracy on the scene parsing task. Since our aim is to reveal differences in representations of the scene areas in the brain by comparing scene components, for the second set of experiments, it is crucial to select components detected with DNNs from the literature that achieve the highest accuracy in scene parsing. For this reason, we selected 3 state-of-the-art models on the scene parsing task namely Resnet101-PPM [Zhou et al., 2016], UperNet101 [Xiao et al., 2018], and HR-Netv2 [Sun et al., 2019]. All the 3 state-of-the-art models were trained on the ADE20k dataset. We selected multiple models to investigate if the results we obtain are consistent across different models. Resnet101-PPM consists of a dilated version of the Resnet101 model (a deeper version of Resnet50 that consists of a total of 101 layers) trained on Imagenet as the encoder and a Pyramid Pooling module with deep supervision [Zhao et al., 2017] as the decoder. Due to the small receptive field in the feature maps, scene-parsing DNNs fail to correctly segment larger objects/components. The Pyramid Pooling module [Zhao et al., 2017] tackles this issue by fusing the feature maps that have different receptive field sizes to merge high spatial resolution information with low spatial resolution information for a better local and global level scene understanding. Upernet101 [Xiao et al., 2018] is based on the Feature Pyramid Network by Lin et al. [2017] that uses multi-level feature representations via a top-down architecture to fuse high-level semantic features with mid and low-level using lateral connections. Upernet101 also has a Pyramid Pooling Module before the top-down architecture to overcome the small receptive-field issue. HRNetv2 [Sun et al., 2019] relies on the importance of high-resolution feature maps for pixel labeling maps by maintaining high-resolution feature representations throughout the architecture and by merging information from both high and low resolution convolutions in parallel overcomes the small receptive field issue mentioned above. We downloaded all above mentioned models from https://github.com/CSAILVision/semantic-segmentation-pytorch.
To reveal performance differences between different models on the scene parsing task, we compared the performance of state-of-the-art models with the scene parsing models used for comparison with scene-classification DNNs (see above, Alexnet, Resnet-18, Resnet-50). For the comparison, we calculated the mean intersection over union (mIoU) score of detecting all components for all the images from the ADE20k validation dataset. The IoU score is calculated by dividing the intersection between a predicted and corresponding ground truth pixel-level segmentation mask by their union. IoU is a standard metric to evaluate the overlap of a predicted and corresponding pixel-level mask of a particular component. Mean IoU is calculated by taking the mean of IoU scores across all images in the validation dataset for all components.
As illustrated in Figure 1a, a scene parsing model decomposes an image into its constituent components. This decomposition allows investigating which scene components are more relevant to explaining the representations in scene-selective brain regions. We first identified which scene components are present in all the images from the stimulus set used for obtaining fMRI responses. To achieve this, we feedforwarded all the 50 images in the stimulus set of the fMRI dataset through the models and checked the presence of all the components in the image. Since the DNN has been trained on an image dataset that is different from the set of stimuli used for the fMRI data, not all scene components predicted by the DNN appear in the stimulus set. In this particular set, we found that wall, floor, and ceiling were core scene components present in all images.
A scene parsing model outputs a spatial probability map for each component. To scale the spatial probability maps corresponding to different components in the same range, we normalized the spatial probability map for each component independently such that each pixel value lies in the range [0, 255]. We show the extracted normalized scene components corresponding to the wall, floor, and ceiling components for an example stimulus in Figure 1b.
2.4. Representational Similarity Analysis (RSA)
We applied representational similarity analysis (RSA; [Kriegeskorte et al., 2008] to compare DNN activations and scene components with neural and behavioral responses. RSA enables relating signals from different source spaces (such as here behavior, neural responses, DNN activation) by abstracting signals from separate source spaces into a common similarity space. For this, in each source space condition-specific responses are compared to each other for dissimilarity (e.g., by calculating Euclidean distances between signals) and the values are aggregated in so-called representational dissimilarity matrices (RDMs) indexed in rows and columns by the conditions compared. RDMs thus summarize the representational geometry of the source space signals. Different from source space signals themselves RDMs from different sources spaces are directly comparable to each other for similarity and thus can relate signals from different spaces. We describe the construction of RDMs for different modalities and the procedure by which they were compared in detail below.
fMRI ROI RDMs
First, for each ROI (OPA, and PPA), individual subject RDMs were constructed using Euclidean distances between the voxel response patterns for all pairwise comparisons of images. Then, subject-averaged RDMs were constructed by calculating the mean across all individual subject RDMs. We downloaded the subject averaged RDMs of OPA and PPA from the link (https://figshare.com/s/5ff0a04c2872e1e1f416) provided in Bonner and Epstein [2018].
Navigational affordance model (NAM) RDMs
NAM RDMs were constructed using Euclidean distances between the navigational affordance histograms for all pairwise comparisons of images. We downloaded the NAM RDM from (https://figshare.com/s/5ff0a04c2872e1e1f416).
DNN RDMs
For all the DNNs we investigated in this work, we constructed the RDM for a particular layer using 1-ρ, where ρ is the Pearson’s correlation coefficient, as the distance between layer activations for all pairwise comparisons of images. For scene classification DNN RDMs, we created one RDM for each of the 5 convolutional layers (conv1-conv5) and for the 3 fully connected layers (fc6,fc7, and fc8) for Alexnet, and the last layer of each block (block1 - block4) and the final classification layer (fc) of Resnet-18/Resnet-50 to compare with neural/behavioral RDMs. For scene parsing DNN RDMs, we created one RDM for each of the 5 convolutional layers (conv1-conv5) and for the 2 decoder layers (d1 and d2) for Alexnet, and the last layer of each block (block1 - block4) and 2 decoder layers (d1 and d2) of Resnet-18/Resnet-50 to compare with neural/behavioral RDMs.
Scene component RDMs
For each of the scene components investigated we constructed RDM for it using 1-ρ as the distance between normalized spatial probability maps of that scene component, based on all pairwise comparisons of images.
Comparing DNN and scene component RDMs with behavioral and neural RDMs
In this work, we pose two questions: first, whether scene parsing models can better explain scene-selective neural responses and navigational affordance behavioral responses better than scene-classification models, and second, whether the scene-components detected by scene parsing models reveal differences in representations of scene-selective ROIs.
To investigate the first question, we calculated the Spearman’s correlation between the RDMs of different layers of a scene-classification DNN with a particular behavioral/neural RDM and selected the layer RDM that showed the highest correlation with the behavioral/neural RDM. We used the selected layer RDM as the representative RDM for that architecture. We repeated the same procedure to select the representative RDM from a scene parsing model. As a baseline for comparison, we also considered a randomly initialized model and selected the representative RDM from it.
We first found that deeper layers of both scene-classification and scene-parsing models showed a higher correlation with neural responses than earlier layers. A possible explanation behind the observed trend could be that deeper layers are more task-specific while early layers learn low-level visual features irrespective of tasks and therefore do not represent task-specific information in the model. Moreover, the highest correlation with PPA and OPA was found in deeper layers of the network further supporting this idea. We report the layer used to select the representative RDMs for each model in Table 1. To compare which model RDM (scene parsing/scene-classification/random) explains behavioral/neural RDM better, we compared the correlation values of all three RDMs with behavioral/neural RDM (illustrated in Figure 1c for scene classification vs scene parsing).
Table 1. Correlation value and layer information of the layer that showed the highest correlation with a particular brain area or behavior for all the models considered (AlexNet, ResNet18 and ResNet50).
Task | Models | OPA | PPA | NAM |
---|---|---|---|---|
Scene class. | Alexnet | 0.395(fc7) | 0.369(fc7) | 0.122(conv5) |
Resnet18 | 0.391(fc) | 0.397(fc) | 0.114(block4) | |
Resnet50 | 0.364(fc) | 0.365(fc) | 0.111(block4) | |
Scene parsing | Alexnet | 0.393(d2) | 0.393(d1) | 0.162(conv5) |
Resnet18 | 0.426(d1) | 0.415(d1) | 0.198(block4) | |
Resnet50 | 0.415(d1) | 0.418(d1) | 0.165(d2) | |
Random | Alexnet | 0.169 (fc7) | 0.167(fc6) | 0.015(fc7) |
Resnet18 | 0.194(fc) | 0.176(fc) | 0.041(block4) | |
Resnet50 | 0.152(fc) | 0.142(fc) | 0.042(block4) |
To investigate whether the scene-components detected by scene parsing models reveal differences in representations of scene-selective ROIs, we computed the correlation between a scene component RDM and a neural RDM and compared which scene component explains better a particular ROI.
2.5. Variance Partitioning
While in its basic formulation RSA provides insights about the degree of association between a DNN RDM and a behavioral/neural RDM, it does not provide a full picture of how multiple DNN RDMs together explain the behavioral/neural RDM. Therefore, we applied a variance partitioning analysis that determines the unique and shared contribution of individual DNN RDMs in explaining the behavioral/neural RDM when considered in conjunction with the other DNN RDMs. Further, variance partitioning allows selection of multiple layers from a single model to explain the variance in neural and behavioral RDM.
We illustrate the variance partitioning analysis in Figure 1d. We assigned a behavioral/neural RDM as the dependent variable (referred to as predictand). We then assigned two model (DNN/scene component) RDMs as the independent variables (referred to as predictors). Then, we performed three multiple regression analyses: one with both independent variables as predictors, and two with individual independent variables as the predictors. Then, by comparing the explained variance (r2) of a model used alone with the explained variance when it was used with other models, the amount of unique and shared variance between different predictors can be inferred (Figure 1d). In the case of three independent variables, we performed seven multiple regression analyses: one with all 3 independent variables as predictors, three with different combinations of 2 independent variables as predictors, and three with individual independent variables as the predictors.
To compare scene parsing and scene-classification models with a randomly initialized model as the baseline, the predictors were the respective DNN RDMs and predictands were the behavioral and neural RDMs. We performed variance partitioning analysis first using the selected representative RDMs (Table 1) for each model using RSA. In a second analysis, we relax the criteria of representing a model by single layer RDM and use multiple layer RDMs together to represent the model. We selected all the layer RDMs (Table 1) from each model (scene-classification/scene-parsing/random) and used them as predictors for variance partitioning.
To compare different scene components, the predictors were the respective scene component RDMs and predictands were the neural RDMs of scene-selective ROIs and behavioral RDM.
2.6. Statistical Testing
We applied nonparametric statistical tests to assess the statistical significance in a similar manner to a previous related study [Bonner and Epstein, 2018]. We assessed the significance of the correlation between neural/behavioral responses with a DNN through a permutation test by permuting the conditions randomly 5000 times in either the neural ROI RDM or the DNN RDM.
From the distribution obtained using these permutations, we calculated p-values as one-sided percentiles. We calculated the standard errors of these correlations by randomly resampling the conditions in the RDMs for 5000 iterations. We used re-sampling without replacement by subsampling 90% (45 out of 50 conditions) of the conditions in the RDMs. We used an equivalent procedure for testing the statistical significance of the correlation difference and unique variance difference between the two models. The statistical outcomes were corrected for multiple comparisons using false detection rate (FDR) correction with a threshold equal to 0.05.
3. Results
3.1. Are scene parsing models suitable to account for scene-selective brain responses and scene-related behavior?
We investigated the potential of DNNs trained on scene parsing to predict scene-related human brain activity focusing the analysis on scene-selective regions OPA and PPA. To put the result into context we compared the predictive power of DNNs trained on scene parsing to DNNs trained on scene classification, which are currently the default choice in investigating scene-related brain responses and behavior [Bonner and Epstein, 2018, Groen et al., 2018, Cichy et al., 2017] and against a randomly initialized DNN as baseline. To ensure that the results can be attributed to differences in the task rather than being specific to particular network architecture, we investigated three different network architectures: Alexnet, Resnet18, and Resnet50.
We applied representational similarity analysis (RSA) to relate DNN models (scene classification,scene parsing, and random) with the brain responses in OPA and PPA (Figure 2a). We found that DNNs trained on scene parsing significantly predicted brain activity in all investigated regions (p = 0.0001 for Alexnet, p = 0.0001 for Resnet18, and p = 0.0001 for Resnet50). This shows that they are suitable candidate models for the investigation of brain function. We further found that DNNs trained on scene parsing explain as much or more variance in scene-selective regions than DNNs trained on scene-categorization. We note both scene parsing and scene classification DNNs explain significantly higher variance in scene-selective regions than a randomly initialized DNN across different architectures (p<0.001 for all the comparisons).
If scene parsing models are suitable models for predicting responses in scene-selective brain regions, and these regions underlie scene understanding, the models should predict scene-related behavior, too. We considered navigational affordance behavior operationalized as the angular histogram of navigational trajectories that participants indicated for the stimulus set. Paralleling the results on brain function, the investigation of behavior showed that DNNs trained on scene parsing predicted behavior significantly (p = 0.0005 for Alexnet, p = 0.0005 for Resnet18, and p = 0.0005 for Resnet50), and also significantly better than DNNs trained on scene-classification (p = 0.01 for Alexnet, p = 0.0005 for Resnet18, and p = 0.03 for Resnet50). Similar to results on brain function, we note that both scene parsing and scene classification DNNs explain significantly higher variance in behavior than a randomly initialized DNN across different architectures.
While the RSA results above provided insights about the degree of association between a DNN RDM and behavioral/neural RDM, it cannot tell how multiple DNN RDMs together predict the behavioral/neural RDM. For this more complete picture, we conducted variance partitioning to reveal the unique variance of neural/behavioral RDMs explained by scene-classification and scene parsing DNN RDMs (Figure 2b). We observe from Figure 2b that scene parsing DNNs explain more variance uniquely (OPA: p = 0.0003 for Alexnet, p = 0.0002 for Resnet18, and p = 0.0002 for Resnet50 ; PPA: p = 0.0002 for Alexnet, p = 0.0003 for Resnet18, and p = 0.0002 for Resnet50) than scene-classification DNNs for both scene-selective ROIs. We further observe from Venn diagrams in Figure 2b that for scene-selective neural RDMs most of the variance explained is shared between scene-classification and scene parsing DNNs across all three architectures. The results suggest that scene parsing DNNs might be a better choice for investigating scene-selective neural responses than scene-classification DNNs.
We observe for behavior that the scene-classification DNNs explain nearly no unique variance, while on the other hand scene parsing DNNs explain significantly higher unique variance (p = 0.001 for Alexnet, p = 0.0002 for Resnet18, and p = 0.0005 for Resnet50) across all three architectures (see Figure 2b for unique variance and Venn Diagrams illustrating both unique and shared variances). The results suggest that since the scene parsing task takes into account the spatial arrangement of constituent components in the scene, a scene parsing DNN explains behavioral affordance assessments better than a scene classification DNN.
In both the above analysis, we selected the RDM of the layer of a model that showed highest RSA correlation with a neural/behavioral RDM as the representative RDM for the model but this brings the question how well a particular layer represents a model as a whole. To answer this question, we selected multiple layer RDMs from each model (scene-classification/ scene parsing/ random) and applied variance partitioning to find out how much of the variance in neural/behavior RDM is explained uniquely by a given model represented by multiple layer RDMs. We report the results in Figure 2c and observe a similar trend as observed in Figure 2b using layer RDMs that showed maximum correlation. The results suggest that layers that showed maximum correlation with a neural/behavior RDM were deeper in the network and therefore have more task-specific representation as opposed to earlier layers of the network. The variance explained by earlier layers contributes mostly to the shared variance of neural/behavioral RDMs explained by all models together.
Together, these results establish DNNs trained on scene parsing tasks as a promising model class for investigating scene-selective cortical regions in the human brain and for navigational behavior related to the spatial organization of scene components.
3.2. State-of-the-art scene parsing models for investigating scene components represented in the human brain
Models trained on the scene parsing task offer the possibility to selectively investigate which of the scene components (such as wall, ceiling or floor) they encode. But, what is the most suitable scene parsing model to compare to the brain? In the model comparison above our choice was guided by making models as similar to each other as possible in complexity to rule out that observed differences in accounting for brain activity are simply due to differences in model complexity. However, for in-depth investigation of scene-selective areas using scene-components it is crucial to choose models that detect the scene-components with as high accuracy as possible. Therefore, we compared the performance of different scene parsing models qualitatively and quantitatively to select the most accurate ones to compare with the responses of scene-selective brain areas.
For performance comparison on the scene parsing task, we chose the three models used above (Alexnet, Resnet18, and Resnet50), plus three state-of-the-art models of scene parsing: HRNetv2, Upernet101, and Resnet101-PPM. The state-of-the-art models achieve high performance by merging low-resolution feature maps with high-resolution feature maps to generate results in high spatial resolution. We illustrate their parsing performance qualitatively by examining their output on an example image (Figure 3). We observe that the scene parsing output generated by Resnet50 had smooth and less precise boundaries of components while Resnet101-PPM, Upernet101, and HRNetv2 detected components accurately with precise boundaries in their outputs.
To quantitatively compare model performance, we evaluated the performance of all models on the ADE20k validation dataset. For (mIoU) score of detecting all components for all the images from the ADE20k validation dataset. We report mIoU scores of individual components that were present in all images in the stimulus set: wall, ceiling, and floor. The results are reported in Table 2. They indicate that state-of-the-art models beat the complexity-matched models by a margin of 12% accuracy. Therefore, for in-depth investigation of representations in scene-selective brain areas we used the top 3 models, i.e., HRNetv2, Upernet101, and Resnet101-PPM.
Table 2. Scene parsing performance on ADE20k validation set. The table shows the accuracy of detecting selected components along with overall accuracy for different scene parsing models in decreasing order.
Model | Wall | Ceiling | Floor | Overall accuracy |
---|---|---|---|---|
HRNetv2 | 0.7538 | 0.8278 | 0.7811 | 0.4320 |
Upernet101 | 0.7503 | 0.8265 | 0.7772 | 0.4276 |
Resnet101-PPM | 0.7453 | 0.8195 | 0.7659 | 0.4257 |
Resnet50 | 0.6422 | 0.7356 | 0.6642 | 0.3020 |
Resnet18 | 0.6223 | 0.6997 | 0.6627 | 0.2741 |
AlexNet | 0.5857 | 0.6810 | 0.6105 | 0.2306 |
3.3. Scene parsing networks reveal a differential contribution of wall, floor and ceiling components to representations in scene-selective regions
We investigated whether the scene components detected by a scene parsing DNN reveal a difference in the representational content of scene-selective ROIs. We focused on the three scene components - wall, floor, and ceiling - that were present in all the images of the stimulus set and compared them with scene-selective ROIs OPA and PPA and behavioral model NAM.
We first report the RSA results (Figure 4a) of comparing a scene component RDM with OPA, PPA and NAM for three state-of-the-art architectures HRNetv2, Upernet101, and Resnet101- PPM. We found that the correlation of the OPA RDM with the floor RDM was significantly higher than of the wall (p = 0.02 for HRNetv2, p = 0.05 for Upernet101, p = 0.0002 for Resnet101- PPM) and ceiling (p = 0.01 for HRNetv2, p = 0.01 for Upernet101, p = 0.0002 for Resnet101- PPM) RDMs, and the correlation of the PPA RDM with the wall (p = 0.006 for HRNetv2, p = 0.001 for Upernet101, p = 0.01 for Resnet101- PPM) and floor (p = 0.006 for HRNetv2, p = 0.003 for Upernet101, p = 0.001 for Resnet101- PPM) RDMs was significantly higher than with the ceiling RDM. NAM which represents the navigational paths in the scenes showed the highest correlation with floor RDM which was significantly higher than the correlation with wall (p = 0.0002 for HRNetv2, p = 0.0002 for Upernet101, p = 0.0002 for Resnet101- PPM) and ceiling (p = 0.0002 for HRNetv2, p = 0.0002 for Upernet101, 0.0002 for Resnet101- PPM) RDM. The above results held consistently across all investigated models. Together, this suggests that OPA and PPA have differential representational content with respect to scene components.
To tease out how much variance in OPA and PPA is explained by individual scene components, we apply variance partitioning to find the unique and shared variance of OPA and PPA RDMs explained by different scene component RDMs. We report the variance partitioning results showing unique variance explained by each component along with Venn diagram illustrating both unique and shared variances in Figure 4b. We observed that in the case of OPA, the floor RDM explains significantly higher variance of OPA RDM uniquely compared to wall (p = 0.0002 for HRNetv2, p = 0.002 for Upernet101, p = 0.0008 for Resnet101- PPM) and ceiling (p = 0.0002 for HRNetv2, p = 0.002 for Upernet101, p = 0.0008 for Resnet101- PPM) RDMs. For PPA, the wall RDM explains significantly higher variance of PPA RDM uniquely compared to the floor (p = 0.001 for HRNetv2, p = 0.006 for Upernet101, p = 0.015 for Resnet101- PPM) and ceiling (p = 0.0005 for HRNetv2, p = 0.002 for Upernet101, p = 0.007 for Resnet101- PPM) RDMs. And for NAM, the floor RDM explains significantly higher variance as compared to the wall (p = 0.0002 for HRNetv2, p = 0.0002 for Upernet101, p = 0.0002 for Resnet101- PPM) and ceiling (p = 0.0002 for HRNetv2, p = 0.05 for Upernet101, p = 0.0002 for Resnet101- PPM) RDMs. Consistent with the RSA results above, this result reinforces the differences between OPA and PPA in the representation of scene components.
4. Discussion
In this study, we investigated the potential of scene parsing DNNs in predicting neural responses in scene-selective brain regions. We found that scene parsing DNNs predicted responses in scene-selective ROIs OPA and PPA better than scene-classification DNNs. We further showed that scene components detected by scene parsing DNNs revealed differences in representational content of OPA and PPA.
Previous work using DNNs to predict neural responses has emphasized the importance of the task for which the DNNs were optimized for [Yamins and DiCarlo, 2016, Khaligh-Razavi and Kriegeskorte, 2014, Richards et al., 2019]. We argue that the higher unique variance of scene-selective neural responses explained by scene parsing DNNs over scene-classification DNNs is due to such a difference in tasks. The scene classification task aims at identifying the category of the scene irrespective of the spatial organization of different components and objects in the scene. In contrast, the scene parsing task requires pixelwise labeling of the whole image and thus a more comprehensive understanding of the scene in terms of how different objects and components are spatially organized within a given scene. Higher variance of the scene selective neural responses explained by the scene-parsing DNNs that encode spatial structure suggests that scene-selective neural responses also encode spatial structure of the scene entailing the position of different objects and components.This view is supported further by evidence from neuroimaging literature [Kravitz et al., 2011, Park et al., 2011] showing that scene-selective regions represent the spatial layout of scenes. The information about the spatial structure of the scene might be required by the brain to plan interaction within the scene, such as navigating to a target, reaching objects or performing visual search.
Our in-depth analysis using scene components revealed differential representations in OPA and PPA. We observed that OPA had a significantly higher correlation with floor than ceiling and wall. A possible explanation for the observed difference could be due to OPA’s involvement in detecting navigational affordances [Bonner and Epstein, 2017], for which the floor plays a major role, and could explain the high sensitivity of OPA to stimulation in the lower visual field [Silson et al., 2015]. In contrast, we found that PPA shows a significantly higher correlation with wall as well as floor compared to ceiling. This could explain why PPA has sensitivity to the upper visual field [Silson et al., 2015]. A plausible explanation could be that detecting the wall is relevant to identifying the type of room, its texture [Henriksson et al., 2019, Park and Park, 2017] and landmarks [Troiani et al., 2012].
Previous work has already aimed at determining the nature of OPA representations by computational modelling [Bonner and Epstein, 2018] on the same fMRI dataset that was used in our study. For this, the authors determined for a DNN trained on scene categorization which individual DNN units most correlated with NAM and OPA and visualized those units using receptive field mapping and segmentation from Zhou et al. [2014]. The units extracted corresponded mostly to uninterrupted portions of floor and wall or the junctions between floor and wall. The results align with our findings that floor components explain OPA responses uniquely while the wall units could be attributed to shared variance explained by floor and wall components in our study. However, arguably segmentation maps extracted using the receptive field mapping method are less interpretable as they cannot be directly assigned to meaningful entities without additional human annotations [Zhou et al. 2014] or by comparing with ground-truth segmentation maps of meaningful entities [Bau et al. 2017]. Further, assigning a segmentation map of a unit obtained using receptive field mapping to a meaningful entity using ground-truth segmentation maps leads to less accurate segmentation maps [Bau et al. 2017] compared to the segmentation from a scene parsing DNN [Xiao et al. 2018] trained to segment components using ground truth segmentation maps. Thus, we believe our approach using scene-components generated by a scene parsing DNN to be particularly well suited to reveal the representational content of scene-selective brain regions.
In our analysis investigating navigational affordance behavioral responses, we found that scene-parsing DNNs explained significantly higher variance of behavioral RDM than the scene classification DNN. A plausible explanation for this finding is that determining navigational affordances requires spatial understanding of the scene, including floor and obstacle detection. Scene parsing tasks explicitly require such spatial understanding, whereas scene classification tasks can be performed without explicitly detecting the spatial organization of objects and components in the scene. Further, the layers that show highest correlation with NAM in scene classification models are later convolutional layers which preserve spatial information of the image suggesting that spatial information is required to explain NAM. In the comparison of NAM with scene components we found that NAM is best explained by the floor component, while other components (wall, ceiling) explained insignificant unique variance. The above finding further reinforces our argument that scene-parsing DNNs explain NAM responses better due to the task requirement of finding spatial layout of objects and components. While in this study, we focused on showing the advantage of scene-parsing DNNs over scene-classification DNNs in explaining scene-selective neural and navigational affordance-related behavioral responses, our results do not rule out scene-classification DNNs as useful models to explain semantic behavioral responses related to scene category or semantic similarity.
A limitation of our study is that the differences revealed between OPA and PPA is based on the analysis of only 3 scene components. This is due to the limitations of the stimulus set, which consistently had only 3 components that were present in all 50 images. Future work should exploit the full richness of scene components provided by DNNs trained on scene parsing. For this a stimulus set would have to be designed that contains many components in all images of the stimulus set. Another possible direction would be to use stimuli with annotations and use these annotations directly to compare with the fMRI responses. The advantage of using a scene-parsing DNN over human annotations is that once the DNN is trained on scene-parsing, the components can be extracted for a new set of stimuli with zero cost as opposed to annotations, where human effort is required to annotate every new stimulus set.
It is crucial to point out that our findings could be influenced by the task participants performed while inside the scanner. The participants were required to identify whether the presented image was a bathroom or not. Most of the studies do not look into the influence of tasks in the fMRI studies and the opinion on whether the difference in tasks results in different representations is divided. For instance, some studies [Duncan, 2010; Woolgar et al., 2011] suggest that the parietal and prefrontal cortex are involved in representing task context while the scene processing is attributed to the occipitotemporal cortex. Recently, some studies [Harel et al., 2014; Erez and Duncan, 2015; Lowe et al., 2016; Bracci et al., 2017; Bugatus et al., 2017; Vaziri-Pashkam and Xu, 2017; Hebart et al. 2018] have shown evidence of task influence in the occipitotemporal cortex. Therefore, a promising future direction of research might be to find out whether our findings are replicated or not on another fMRI study where participants performed a different task.
To summarize, our findings provided evidence supporting the use of DNNs trained on the scene parsing task as a promising tool to predict and understand activity in the visual brain. We believe that this approach has the potential to be applied widely, providing interpretable results that give insights into how the human visual cortex represents the visual world.
Acknowledgements
We thank Agnessa Karapetian and Greta Häberle for their valuable comments on the manuscript. G.R. thanks the support of the Alfons and Gertrud Kassel Foundation. R.M.C. is supported by Deutsche Forschungsgemeinschaft (DFG) grants (CI241/1-1, CI241/3-1) and the European Research Council Starting Grant (ERC-2018-StG 803370).
Footnotes
Conflict of interest. All authors declare that they have no conflicts of interest.
References
- David Bau, Zhou Bolei, Khosla Aditya, Oliva Aude, Torralba Antonio. Network dissection: Quantifying interpretability of deep visual representations; Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. pp. 6541–6549. [Google Scholar]
- Bonner Michael F, Epstein Russell A. Coding of navigational affordances in the human visual system. Proceedings of the National Academy of Sciences. 2017;114(18):4793–4798. doi: 10.1073/pnas.1618228114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonner Michael F, Epstein Russell A. Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS computational biology. 2018;14(4):e1006111. doi: 10.1371/journal.pcbi.1006111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin Cichy Radoslaw, Khosla Aditya, Pantazis Dimitrios, Torralba Antonio, Oliva Aude. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports. 2016;6:27755. doi: 10.1038/srep27755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin Cichy Radoslaw, Khosla Aditya, Pantazis Dimitrios, Oliva Aude. Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks. NeuroImage. 2017;153:346–358. doi: 10.1016/j.neuroimage.2016.03.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin Cichy Radoslaw, Kaiser Daniel. Deep neural networks as scientific models. Trends in cognitive sciences. 2019;4:305–317. doi: 10.1016/j.tics.2019.01.009. [DOI] [PubMed] [Google Scholar]
- Dilks Daniel D, Julian Joshua B, Paunov Alexander M, Kanwisher Nancy. The occipital place area is causally and selectively involved in scene perception. Journal of Neuroscience. 2013;33(4):1331–1336. doi: 10.1523/JNEUROSCI.4081-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russell Epstein, Kanwisher Nancy. A cortical representation of the local visual environment. Nature. 1998;392(6676):598. doi: 10.1038/33402. [DOI] [PubMed] [Google Scholar]
- Greene Michelle R, Oliva Aude. The briefest of glances: The time course of natural scene understanding. Psychological Science. 2009;20(4):464–472. doi: 10.1111/j.1467-9280.2009.02316.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Groen Iris IA, Greene Michelle R, Baldassano Christopher, Fei-Fei Li, Beck Diane M, Baker Chris I. Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior. eLife. 2018;7:e32962. doi: 10.7554/eLife.32962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uri Hasson, Harel Michal, Levy Ifat, Malach Rafael. Large-scale mirror-symmetry organization of human occipito-temporal object areas. Neuron. 2003;37(6):1027–1041. doi: 10.1016/s0896-6273(03)00144-2. [DOI] [PubMed] [Google Scholar]
- Kaiming He, Zhang Xiangyu, Ren Shaoqing, Sun Jian. Deep residual learning for image recognition; Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. pp. 770–778. [Google Scholar]
- Hebart Martin N, Bankson Brett B, Harel Assaf, Baker Chris I, Cichy Radoslaw M. The representational dynamics of task and object processing in humans. eLife. 2018;7:e32816. doi: 10.7554/eLife.32816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linda Henriksson, Mur Marieke, Kriegeskorte Nikolaus. Rapid invariant encoding of scene layout in human opa Neuron. 2019 doi: 10.1016/j.neuron.2019.04.014. [DOI] [PubMed] [Google Scholar]
- Seyed-Mahdi Khaligh-Razavi, Kriegeskorte Nikolaus. Deep supervised, but not unsupervised, models may explain ITit cortical representation. PLoS computational biology. 2014;10(11):e1003915. doi: 10.1371/journal.pcbi.1003915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kravitz Dwight J, Peng Cynthia S, Baker Chris I. Real-world scene representations in high-level visual cortex: it’s the spaces more than the places. Journal of Neuroscience. 2011;31(20):7322–7333. doi: 10.1523/JNEUROSCI.4588-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nikolaus Kriegeskorte, Mur Marieke, Bandettini Peter A. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience. 2008;2:4. doi: 10.3389/neuro.06.004.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alex Krizhevsky, Sutskever Ilya, Hinton Geoffrey E. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 2012:1097–1105. [Google Scholar]
- Fei-Fei Li, Iyer Asha, Koch Christof, Perona Pietro. What do we perceive in a glance of a real-world scene? Journal of vision. 2007;7(1):10. doi: 10.1167/7.1.10. [DOI] [PubMed] [Google Scholar]
- Tsung-Yi Lin, Dollar Piotr, Girshick Ross, Kaiming He, Hariharan Bharath, Belongie Serge. Feature pyramid networks for object detection; Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. pp. 2117–2125. [Google Scholar]
- O’Craven Kathleen M, Kanwisher Nancy. Mental imagery of faces and places activates corresponding stimulus-specific brain regions. Journal of cognitive neuroscience. 2000;12(6):1013–1023. doi: 10.1162/08989290051137549. [DOI] [PubMed] [Google Scholar]
- Jeongho Park, Park Soojin. Conjoint representation of texture ensemble and location in the parahippocampal place area. Journal of neurophysiology. 2017;117(4):1595–1607. doi: 10.1152/jn.00338.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soojin Park, Brady Timothy F, Greene Michelle R, Oliva Aude. Disentangling scene content from spatial boundary: complementary roles for the parahippocampal place area and lateral occipital complex in representing real-world scenes. Journal of Neuroscience. 2011;31(4):1333–1340. doi: 10.1523/JNEUROSCI.3885-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Potter Mary C. Meaning in visual search. Science. 1975;187(4180):965–966. doi: 10.1126/science.1145183. [DOI] [PubMed] [Google Scholar]
- Richards Blake A, Lillicrap Timothy P, Beaudoin Philippe, Bengio Yoshua, Bogacz Rafal, Christensen Amelia, Clopath Claudia, Costa RuiPonte, de Berker Archy, Ganguli Surya, et al. A deep learning framework for neuroscience. Nature neuroscience. 2019;22(11):1761–1770. doi: 10.1038/s41593-019-0520-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russell Bryan C, Torralba Antonio, Murphy Kevin P, Freeman William T. LabelMe: a database and web-based tool for image annotation. International journal of computer vision. 2008;77(1-3):157–173. [Google Scholar]
- Harry Silson Edward, Chan Annie Wai-Yiu, Reynolds Richard Craig, Kravitz Dwight Jacob, Baker Chris Ian. A retinotopic basis for the division of high-level scene processing between lateral and ventral human occipitotemporal cortex. Journal of Neuroscience. 2015;35(34):11921–11935. doi: 10.1523/JNEUROSCI.0137-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ke Sun, Zhao Yang, Jiang Borui, Cheng Tianheng, Xiao Bin, Liu Dong, Yadong Mu, Wang Xinggang, Liu Wenyu, Wang Jingdong. High-resolution representations for labeling pixels and regions. arXiv preprint. 2019:arXiv:1904.04514 [Google Scholar]
- Simon Thorpe, Fize Denis, Marlot Catherine. Speed of processing in the human visual system. Nature. 1996;381(6582):520–522. doi: 10.1038/381520a0. [DOI] [PubMed] [Google Scholar]
- Vanessa Troiani, Stigliani Anthony, Smith Mary E, Epstein Russell A. Multiple object properties drive scene-selective regions. Cerebral cortex. 2012;24(4):883–897. doi: 10.1093/cercor/bhs364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tete Xiao, Liu Yingcheng, Zhou Bolei, Jiang Yuning, Sun Jian. Unified perceptual parsing for scene understanding; Proceedings of the European Conference on Computer Vision (ECCV); 2018. pp. 418–434. [Google Scholar]
- Yamins Daniel LK, Hong Ha, Cadieu Charles F, Solomon Ethan A, Seibert Darren, DiCarlo James J. Performance-optimized hierarchical models predict neural responses in higher vi- sual cortex. Proceedings of the National Academy of Sciences. 2014;111(23):8619–8624. doi: 10.1073/pnas.1403112111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yamins Daniel LK, DiCarlo James J. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience. 2016;19(3):356. doi: 10.1038/nn.4244. [DOI] [PubMed] [Google Scholar]
- Hengshuang Zhao, Shi Jianping, Xiaojuan Qi, Wang Xiaogang, Jia Jiaya. Pyramid scene parsing network; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 2881–2890. [Google Scholar]
- Bolei Zhou, Khosla Aditya, Lapedriza Agata, Oliva Aude, Torralba Antonio. Object detectors emerge in deep scene cnns. arXiv preprint. 2014:arXiv:1412.6856 [Google Scholar]
- Bolei Zhou, Lapedriza Agata, Khosla Aditya, Oliva Aude, Torralba Antonio. Places: A 10 million image database for scene recognition; IEEE transactions on pattern analysis and machine intelligence; pp. 1452–1464. [DOI] [PubMed] [Google Scholar]
- Bolei Zhou, Zhao Hang, Puig Xavier, Fidler Sanja, Barriuso Adela, Torralba Antonio. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision. 127(3):302–321. [Google Scholar]