1 Introduction
Despite their relative simplicity, Bayesian networks have found a prominent place in the probabilistic modeling literature due to their attractive properties with regard to inference and interpretation. Also referred to as belief networks or Bayes nets, they are a class of models that borrow formalism from graph theory in order to streamline the process of creating probability models enforcing a desired correlation structure between modeled variables. Formally, a Bayesian network is a model used to parameterize a joint probability distribution
P over
p random variables in terms of a directed graph
\(\mathcal {G}=(\mathcal {X},E)\). Each random variable is represented as a member of a set of nodes
\(\mathcal {X}=\lbrace X_1,\ldots ,X_p\rbrace\) and elements of
E are edges connecting pairs of nodes. Any two nodes lacking an edge correspond to a pair of random variables which are conditionally independent; these edges are dependency statements asserting that
\(X_i \not\perp X_j | \mathcal {X}-\lbrace X_i, X_j\rbrace ,\) where
\(\not\perp\) denotes the conditional independence relation given a set of conditioning variables. Then
\(Pa^\mathcal {G}(X_j)\), the parents of the
jth node with respect to
\(\mathcal {G}\), are defined as the subset of nodes
\(\mathcal {X}\) which are the starting points of an edge in
\(\mathcal {G}\) ending in
\(X_j\). In the nomenclature of the graphical modeling literature, it is possible to represent nearly all independencies in
P via connections between nodes in
\(\mathcal {G}\) [
93] save for a small set of pathological cases. Later on, we will discuss situations in which there we are interested in studying a multivariate spatial process. When this occurs, we use the superscript
i to index spatial location and the subscript
j to index the process or variable being considered. Thus,
\(X^{(i)}_j\) refers to the
jth variable observed at the
ith location in space. When dealing with graphical models over a temporal or spatial domain, it is common to use terminology referring to a network
template or
class which then is
instantiated multiple times to allow for distinct random variables representing the same conceptual process to be modeled at different points in time or space representing different conceptual objects which share commonality [
73]. For a single data point, the Bayesian network factorization property requires that
P can be written as
and thus allows a probability distribution over several random variables to be specified in terms of multiple lower-dimensional
conditional probability distributions (
CPDs). More generally, there exist
undirected graphical models which relax the requirement of directed edges. The range of models and algorithms contained within the scope of graphical modeling is very expansive, including factor analysis, filtering methods, regressions [
99], and even variants of deep neural networks used in machine learning [
66,
119]. Here, we clarify that the scope of this review article covers those models for which a freeform specification of conditional independencies between variables is possible. This includes, for instance, the commonly-used model formulation often treated as synonymous with “Bayesian network” which employs tabular CPDs for discrete data. It does not, however, include the class of
generalized linear models (
GLMs) as the conditional independence structure within this class is generally fixed to encode dependence only between covariates and response variables while omitting dependencies between covariates. For Gaussian Bayesian networks, these conditional independence statements correspond precisely to null entries in the inverse covariance or precision matrix.
Bayesian networks are closely related to structural equation models [
41] and are distinguished from other types of graphical models by the usage of directed edges instead of undirected, symmetric edges to make more specific statements about conditional independencies between variables resulting from the graphical structure. Whereas undirected graphical models such as the Ising/autologistic model [
14] and the Boltzmann machine [
56] do not make statements about the direction of edges, Bayesian networks may use this directionality to encode causal information [
110]. While it may not be the case that every Bayesian network analysis requires causal interpretation of the links, it is often true that these are used to represent a causal mechanisms [
64]. A sparser Bayesian network graph with fewer edges implies that there are fewer cross-variable correlations. Consequently, a sparsely connected Bayesian network composed of multiple smaller subgraphs implies a joint probability distribution that is more easily factorized into a small number of low dimensional CPDs. This modularity is greatly helpful in statistical inference as many algorithms such as the junction tree method [
70,
109] and Gibbs sampling [
58] are either easier to implement or computationally less expensive with additional conditional independence assumptions.
There is no restriction on the support of the random variables used, though discrete variables are often favored in some instances due to the existence of computationally efficient algorithms for exact inference in discrete Bayesian networks such as the
junction tree [
79] algorithm and
variable elimination. While the junction tree algorithm can be extended to the class of conditional linear Gaussian models for continuous data [
78], there appears to be no general-purpose and computationally efficient exact inference method for continuous Bayesian networks in a manner analogous to discrete Bayesian networks. In the discrete setting, perhaps the most commonly used parametrization of a Bayesian network CPD is a
tabular one in which every element of
\(Val(Pa^{\mathcal {G}}(X_j)))\) is associated with a distinct CPD for the values of
\(X_j\). Another benefit of Bayesian networks is that their creation usually involves a structural decomposition of a probability distribution over several dimensions into lower dimensional conditional distributions. This bears close parallels to methods naturally used to improve decision-making via the decomposition of problems into smaller pieces [
7]. Bayesian networks bear strong similarities to earlier attempts to introduce uncertainties into rule-based systems [
64].
A common workflow for modeling data with Bayesian networks involves the steps of identifying a suitable model structure, applying prior information, performing parameter estimation, interpreting the parameters, and/or generating predictions for missing data or altogether new data points. These steps are explained in more detail in the context of spatial modeling below. A summary of the notation commonly used in this work is provided at Table
1.
Selecting a Structure. At the beginning of the modeling process, the analyst must determine which of the variables
\(X_1,\ldots ,X_p\) are conditionally independent given all other variables. As the conditional independence relation is symmetric, there are
\(p(p-1)/2\) possible conditional independence statements and a graph structure
\(\mathcal {G}\) is created by selecting pairs of variables to be linked by a directed edge. If no sensible graph can be constructed a priori, or if it is desired that the graph itself be left as a free parameter, then
structure learning must be employed to identify a suitable graph on the basis of an optimality criterion such as the Akaike or Bayesian information criterion, or the Bayesian Dirichlet-equivalent uniform score [
55]. When working with spatial data, a complication that arises is that there does not appear to be a standard recipe for selecting a structure over both cross-variable and spatial modes of correlation. One may instantiate a Bayesian network for which every variable at every spatial location receives a node in the graph, but this implies an enormous space of potential graph structures. This can lead to a challenging graph selection problem even if only a single variable is used across a large spatial domain; we refer the reader to [
123] for an example of structure learning under these conditions. Additionally, the edges in
\(\mathcal {G}\) are directed, implying a parent-and-child ordering which may not conform with the commonly-held assumption in spatial statistics that spatial correlation is most easily captured via an undirected, unordered correlation model.
Choosing Conditional Probability Distributions. Once a suitable Bayes net graph has been identified, the next step is to construct functions mapping the outcomes of parent variables to probabilities of conditional outcomes of child variables dependent upon a CPD parameter vector
\(\mathbf {\theta }\). For data comprising discrete parent and discrete child variables, a common choice appears to be a tabular CPD for which every element in
\(Val(Pa^{\mathcal {G}}(X_j))\) has its own
\(\vert Val(X_j)\vert\)-dimensional multinomial probability distribution for the values assumed by
\(X_j\). With a tabular CPD, it is also possible to set or constrain [
102] some of the parameter values to reflect prior information. For modeling spatial data in general, it is often desirable to allow for correlation between spatially proximal random variables. Therefore, a central question in the application of Bayesian networks to spatial data is the development of methods to produce CPDs which retain desirable properties such as simplicity and interpretability, while also modeling spatial correlation.
Parameter Estimation. The selection of BN parameters
\(\mathbf {\theta }\) from a set of possible values
\(\Theta\) for the CPDs from the previous section with the aid of observed data
\(\mathbf {X}\) can be done via the calculation of an estimator as a solution to the following problems:
Maximum likelihood \(\mathop{\text{argmax}}\limits _{\mathbf {\theta } \in \Theta } p(\mathbf {X}\vert \mathbf {\theta })\)
Maximum a posteriori \(\mathop{\text{argmax}}\limits _{\mathbf {\theta } \in \Theta } p(\mathbf {X}\vert \mathbf {\theta })p(\mathbf {\theta })\)
Bayes estimator (mean square error) \(E[\mathbf {\theta }\vert \mathbf {X}],\) \(p(\mathbf {\theta }\vert \mathbf {X}) \propto p(\mathbf {X}\vert \mathbf {\theta })p(\mathbf {\theta })\)
with the posterior mean
\(E[\mathbf {\theta }\vert \mathbf {X}]\) yielding the solution to minimizing the mean squared difference between true and estimated values of
\(\mathbf {\theta }\), integrated over the prior distribution. Depending on the specific assumptions made for the Bayesian network, closed-form solutions may exist for any of the above problems. When there are missing data, expectation-maximization, and data augmentation algorithms [
129] are frequently used for the maximum likelihood and posterior mean estimation, respectively. In general, a variety of techniques suitable for general-purpose Bayesian modelings such as Gibbs sampling [
58] and variational Bayes [
15] are also applicable to posterior inference in Bayesian networks. A notable challenge with spatial data is that when the assumption of independent and identically distributed data is relaxed to allow for spatial correlation, closed-form solutions may no longer exist. Furthermore, a very large range of algorithms for estimating the parameters of directed or partially directed graphical models exist. A broad review of parameter estimation for partially directed graphical models and Bayesian networks can be found in [
120].
Interpretation. The values of the estimated parameters may convey information of interest regarding dependencies between variables and, therefore, are often the target of interpretation. For tabular CPDs, the conditional probability parameter estimates are interpretable as-is; parameter estimates yielding highly concentrated CPDs may convey additional conditional dependencies beyond those implied by the network’s graph structure.
Prediction. Much of the applied work covered later in this article emphasizes prediction as a key task for which Bayesian networks are employed. This may involve filling-in missing pieces of information for partially observed data, or generating predictions for new data. For geospatial analysts, it can be very convenient to have simple and efficient routines for generating predictions at every point in a large spatial domain, necessitating integration with a geographical information system (GIS) application.
Practitioners can benefit from spatial modeling by producing more reliable estimates of parameters obtained during inference as well as generating more accurate predictions. During parameter estimation, explicitly including spatial priors for parameters or error processes can help account for unobserved processes smoothly varying over space [
133]. This addition is known to reduce bias in estimates of model parameters in parametric models such as linear regressions [
80] and is likely to also help provide better estimates for parameters associated with Bayesian network models as well.
Related Work. For a more extensive coverage of the theory and application of Bayesian networks, we recommend the tutorial and overview papers by [
20] and [
112] as well as books and book chapters by [
109], [
74], and [
72]. Existing review articles for Bayesian networks applied to environmental risk assessment [
68], resource management [
10], ecosystem services [
76], and environmental modeling [
4] have substantial overlap with this work as spatial data is highly prominent in the ecological and environmental sciences. Additionally, we note that the material covered in this review shares much in common with existing commentary on Bayesian network-GIS integration [
67]; in this work, we do not focus on model-software coupling, instead of framing our review around the models and workflows used by researchers to match the advantages of Bayesian networks to the problem at hand involving spatial data. However, the scope of research covered by this review is not as large as the most general class of models described in part as Bayesian networks; we omit discussion of
dynamic Bayesian networks (
DBNs) as there exists a substantial research and pedagogical literature on spatiotemporal Bayesian modeling [
30,
140]. We note for researchers beginning to work with spatiotemporal data, it may not be entirely clear precisely where and why spatial and temporal data must be approached with different modeling strategies. This confusion may be further compounded by shared terminology for modeling techniques such as temporal autoregressions in the time series literature and conditional autoregressions for the spatial statistical literature. As an aside, we note that DBNs have been applied to data with spatial domains, albeit in only a single dimension [
97]. For readers unfamiliar with the nomenclature used for classifying graphical models, DBNs [
51,
96] subsumes many temporal probabilistic models including hidden Markov models and linear Gaussian models which are amenable to closed-form exact Bayesian inference using methods such as the Viterbi algorithm or Kalman filtering. These efficient algorithms are made possible by the fact that a causal directionality across time can naturally be inferred. Thus, such systems are easily represented as directed graphical models. However, existing probabilistic models of spatial systems such as conditional autoregressions and
Gaussian processes (
GPs)do not share the same property because, in general, it is not reasonable to assume that a total ordering exists on a set of spatial coordinates associated with observations. Put more plainly, there is no a priori reason to believe causality moves from east to west in the same way that it flows from past to present. This precludes naive usage of methods designed for temporal phenomena for spatial data and thus explains why the two subjects diverge into distinct subfields; the former assumes a natural directed graphical structure due to the asymmetry of time flowing from past to present, while the latter favors probability models capable of respecting the homogeneity and isotropy of space. Special cases may deviate from this rule; if there is known to be a strong directionality associated with certain spatial axes due to differences in elevation or consistent atmospheric patterns, it may make sense to assume some type of ordering amongst locations. Another exception to this rule is found in [
34] in which a spatial model is created by averaging over
all possible orderings of spatial locations is done to erase dependency upon a single ordering (Figure
3(C)).
We briefly describe some alternative approaches to probabilistic modeling of spatial data. As the usage of spatial Bayesian networks intersects substantially with spatial statistics, we find it helpful to recall that within the field of spatial statistics, spatial data usually fall into one of three categories [
29]:
areal data which is located on a lattice or network,
geostatistical data which exhibits cross-correlation dependent on interpoint distances, and
point pattern data in which the location of each point is itself a stochastic process. Examples of spatial statistical models designed for each type of data include autoregressions [
12], GPs/kriging [
90], and Poisson processes, respectively. Recently, analysis of richer functional spatial data such as movement trajectories has also become an object of study within spatial statistics [
57]. It is important to note that while many usages of spatial models deal with geographical or physical notions of space, spatial models are generally appropriate for data, which has coordinates in any space endowed with a distance metric. Some representative scenarios include data in non-Euclidean spaces [
31] with a treelike topology [
62,
137] or data associated with angular coordinates. GLMs with adjustments to include spatially correlated random effects [
14] or spatially correlated error variables are a mainstay in spatial econometrics [
5] and in spatial Bayesian modeling [
13]. The primary differentiating factor between spatial GLMs and Bayesian networks is that the former only models dependencies between the covariates and the response variable though their relative simplicity can be attractive for cases in which Bayes net features are unnecessary, while Bayesian networks typically model the joint correlation structure between all variables simultaneously. We note that geostatistical GP models are closely related to spatial GLMs as the prior distribution for a spatial component of a GLM can be represented using a GP defined on a spatial domain [
37]. When a large number of variables are observed simultaneously at many locations, it can be desirable to produce a lower-dimensional summary of the dataset in terms of spatially-smoothed principal components or factor loadings; spatial PCA [
35] and spatial factor analysis [
132] extend preexisting models to allow for spatial autocorrelation among factors. Spatial variants of machine learning models including neural networks [
125] and random forests [
50,
139] also exist; these models typically do not admit a simple interpretation of their parameters and also have little facility for handling missing data in comparison to Bayesian networks.
The objective of the remainder of this article is to help practitioners of probabilistic modeling for spatial data further understand existing research as it relates to the usage of Bayesian networks and identify promising future areas of research to help advance the field. To this end, we have structured this review article into sections separately discussing Bayes net models designed for spatial data, areas of application, and an analysis of promising research opportunities on the topic. A major challenge in using Bayesian networks absent strong prior knowledge is the identification of the network structure from data. This review will not include an in-depth discussion of structure learning despite its prominence [
28,
40,
46] in research on Bayesian networks. Since spatial data is naturally structured, albeit, in an undirected or symmetric manner, it does not appear to pose any special risks or opportunities for learning an unknown graphical structure from data. Additionally, we have been careful to screen out work that uses the descriptor
spatial without any substantial content relevant to spatial probability models or a spatial data workflow.
To aid the reader in understanding the myriad ways that researchers and practitioners have made use of Bayesian networks with spatial data, we constructed a pair of diagrams (Figures
2 and
3) with reference to a hypothetical scenario described in Figure
1, which captures the main dichotomy we find in our review; we can cleanly separate most work covered into (1) modifications of probability models to be faithful to spatial correlations exhibited in the data, and (2) modeling workflows using mostly off-the-shelf Bayesian network tools, with adjustments to the type of data or number of models employed to capture some aspects of the spatial nature of the problem.
3 Application Areas
As an overview article, the chief aim of this document is to help researchers identify connections and links to other disciplines which may use spatial Bayesian networks in a similar fashion. In this section, we discuss the various software tools and types of applications considered for these types of analyses.
An important factor in the widespread use of Bayesian networks in the applied sciences is the existence of several software packages focused on model construction, inference, and interpretation. Commercial software platforms which exist for modeling data with Bayesian networks include
Hugin [
88],
Analytica, and
Netica [
141]. Some of these platforms include direct integration with GISs software for analyzing spatial data. Proprietary software
Hugin has a plugin for QGIS,
Netica has its own within-platform GIS system, and
BayesiaLab [
27] has partial integration with Google Maps. Standalone software packages such as
bnspatial [
89] offer useful utilities for directly ingesting raster and shapefile data into a Bayesian network learning and predictions framework. For users interested in an open-source direct GIS-Bayesian network integration, [
128] describes the creation of an online web server for analyzing geospatial data with Bayesian networks. A commonality between all of these platforms is that, while they treat spatial variables as data that can be visually presented in geographic space, there is no change to the underlying probabilistic model. Therefore, standard approaches to parameter estimation such as the junction tree algorithm and Gibbs sampling are used. Frameworks not explicitly designed for geospatial data such as
bnlearn may be straightforward to integrate with open source GIS software though they do not appear to be able to model spatial autocorrelation. The Bayes Net Toolbox [
98] and at least one other Bayesian modeling framework [
86] used for MCMC for Bayesian networks have the ability to represent spatial processes [
131] within the same model as a Bayesian network, but we are unable to find any existing studies which use these software packages to implement a spatial Bayes net approach. Reference [
117] implements a software framework for combined Monte Carlo/gradient descent procedure for training a wide array of spatial graphical models encompassing spatial Bayesian networks as a special case. General-purpose Bayesian modeling frameworks such as Stan [
19] and PyMC3 [
121] also appear to have all the necessary software components to perform inference for both Bayesian networks and spatial models though there are not any published studies using them for Bayesian networks at this time.
Geographic data indexed with coordinates on the Earth’s surface are abundant in research applications of Bayesian networks for spatial data and are especially common in ecology and environmental science. Both fields are the focus of an especially rich literature in Bayesian modeling [
25] as this framework naturally accommodates inference for a range of flexible modeling forms via MCMC and pool information across multiple noisy sources. Reviews of the usage of Bayesian networks for environmental data are given in [
10,
136]. A salient point made by Uusitalo et al. [
136] is that for spatial or temporal data, using Bayesian networks often requires instantiating a new network or many new nodes for every point in space or time and that this task can be very tedious or may lead to an unwieldy graphical model [
92]. It has been observed [
10] that a commonly used approach is to simply ignore the temporal or spatial coordinates associated with each observation and to consider the dataset as comprising independently distributed data points. This is a frequently adopted independent mode of processing that offers no methodological difficulties in most Bayesian network frameworks. Another advantage of making this independence assumption is that obtaining parameter estimates and generating forecasts on new data is very straightforward via the use of variable elimination or the junction tree algorithm which rely critically on the assumption that variables have a relatively sparse dependency structure. The analyst may also opt to not discard spatial information entirely, but to use axes of spatial variation such as latitude, longitude, and altitude as separate variables on their own [
138] as depicted in Figure
2(A). Usage of Bayesian networks for prediction to generate maps of probabilities or risk is commonplace, having been done to produce raster imagery categorizing regions according to the probability of land use transition in Scotland [
1], deforestation risk in Swaziland [
38], and a per-segment basis for shoreline erosion risk in Vanuatu [
118] and persistence of trout along stream reaches in the western USA [
114]. Using a parallel analysis is also attractive for use with remote sensing; for example, [
69] employed a Bayesian network to estimate leaf area index from Landsat 7 imagery and [
108] performed land use classification for stormwater management. Bayesian networks used in a parallel analysis may also be embedded as part of a larger simulation approach to generate maps of risks or probabilities conditional on the outputs of a physical model [
95].
Interestingly, Bayesian networks can also be used as submodel components in larger, spatially explicit simulations of environmental systems (Figure
3(B)). Reference [
84] shows how to use a Bayesian network to parameterize the flow of information between fishing vessels in an
agent-based model (
ABM), which moves across a spatial grid and interacts with its fish population. Reference [
3] likewise uses a dynamic Bayesian network as part of a spatial simulation of a predator-prey model with rich ecological dynamics. Reference [
71] similarly modeled population change and movement with a Bayes net parameterization of an ABM. As Bayesian networks present desirable properties for risk analysis [
45] and decision-making under uncertainty [
103], they have found substantial use for the prediction of natural hazards and their associated risks. Within this field of study, Bayesian networks have been used to predict wildfire occurrence [
107] and resulting land cover impacts [
127], earthquake risk [
82], urban flood risk [
9], and risk of damage due to avalanches [
53]. The field of epidemiology likewise has produced several new studies generating spatially explicit predictive risk maps [
6,
54,
91] and identifying or predicting disease outbreaks [
11,
65].
All of the studies referenced so far in this section use a
geographic notion of space in which distances are on the order of meters or more and observations can be uniquely linked to locations on Earth’s surface. It is also possible to consider coordinates within non-geographic image data as a valid interpretation of the term “spatial data”. For example, neuroscientists constructed a Bayesian network model over a spatial domain over the surface of the human eye [
135]. Several tasks in computer vision such as modeling image perception [
122], image segmentation [
2,
16,
44], and image classification [
23,
106] can be accommodated within a Bayesian network model. However, over the past decade, research focus has shifted somewhat to more data-driven approaches using deep neural networks [
75,
115], which eschew incorporation of expert knowledge into models structure and instead favor discipline-agnostic best practices and heuristics. Possible directions of advance for further research into methods for analyzing spatial data with Bayesian networks are presented in the next section.