survey

Open access

A Review of Bayesian Networks for Spatial Data

Authors:

Christopher Krapu,

Robert Stewart,

Amy RoseAuthors Info & Claims

ACM Transactions on Spatial Algorithms and Systems, Volume 9, Issue 1

Article No.: 7, Pages 1 - 21

https://doi.org/10.1145/3516523

Published: 17 January 2023 Publication History

All formats PDF

Abstract

Bayesian networks are a popular class of multivariate probabilistic models as they allow for the translation of prior beliefs about conditional dependencies between variables to be easily encoded into their model structure. Due to their widespread usage, they are often applied to spatial data for inferring properties of the systems under study and also generating predictions for how these systems may behave in the future. We review published research on methodologies for representing spatial data with Bayesian networks and also summarize the application areas for which Bayesian networks are employed in the modeling of spatial data. We find that a wide variety of perspectives are taken, including a GIS-centric focus on efficiently generating geospatial predictions, a statistical focus on rigorously constructing graphical models controlling for spatial correlation, as well as a range of problem-specific heuristics for mitigating the effects of spatial correlation and dependency arising in spatial data analysis. Special attention is also paid to potential future directions for the integration of Bayesian networks with spatial processes.

1 Introduction

Despite their relative simplicity, Bayesian networks have found a prominent place in the probabilistic modeling literature due to their attractive properties with regard to inference and interpretation. Also referred to as belief networks or Bayes nets, they are a class of models that borrow formalism from graph theory in order to streamline the process of creating probability models enforcing a desired correlation structure between modeled variables. Formally, a Bayesian network is a model used to parameterize a joint probability distribution P over p random variables in terms of a directed graph \(\mathcal {G}=(\mathcal {X},E)\). Each random variable is represented as a member of a set of nodes \(\mathcal {X}=\lbrace X_1,\ldots ,X_p\rbrace\) and elements of E are edges connecting pairs of nodes. Any two nodes lacking an edge correspond to a pair of random variables which are conditionally independent; these edges are dependency statements asserting that \(X_i \not\perp X_j | \mathcal {X}-\lbrace X_i, X_j\rbrace ,\) where \(\not\perp\) denotes the conditional independence relation given a set of conditioning variables. Then \(Pa^\mathcal {G}(X_j)\), the parents of the jth node with respect to \(\mathcal {G}\), are defined as the subset of nodes \(\mathcal {X}\) which are the starting points of an edge in \(\mathcal {G}\) ending in \(X_j\). In the nomenclature of the graphical modeling literature, it is possible to represent nearly all independencies in P via connections between nodes in \(\mathcal {G}\) [93] save for a small set of pathological cases. Later on, we will discuss situations in which there we are interested in studying a multivariate spatial process. When this occurs, we use the superscript i to index spatial location and the subscript j to index the process or variable being considered. Thus, \(X^{(i)}_j\) refers to the jth variable observed at the ith location in space. When dealing with graphical models over a temporal or spatial domain, it is common to use terminology referring to a network template or class which then is instantiated multiple times to allow for distinct random variables representing the same conceptual process to be modeled at different points in time or space representing different conceptual objects which share commonality [73]. For a single data point, the Bayesian network factorization property requires that P can be written as

\begin{equation} P(X_1,\ldots ,X_p) = \prod _j^p P(X_j | Pa^\mathcal {G}(X_j)) , \end{equation}

(1)

and thus allows a probability distribution over several random variables to be specified in terms of multiple lower-dimensional conditional probability distributions (CPDs). More generally, there exist undirected graphical models which relax the requirement of directed edges. The range of models and algorithms contained within the scope of graphical modeling is very expansive, including factor analysis, filtering methods, regressions [99], and even variants of deep neural networks used in machine learning [66, 119]. Here, we clarify that the scope of this review article covers those models for which a freeform specification of conditional independencies between variables is possible. This includes, for instance, the commonly-used model formulation often treated as synonymous with “Bayesian network” which employs tabular CPDs for discrete data. It does not, however, include the class of generalized linear models (GLMs) as the conditional independence structure within this class is generally fixed to encode dependence only between covariates and response variables while omitting dependencies between covariates. For Gaussian Bayesian networks, these conditional independence statements correspond precisely to null entries in the inverse covariance or precision matrix.

Bayesian networks are closely related to structural equation models [41] and are distinguished from other types of graphical models by the usage of directed edges instead of undirected, symmetric edges to make more specific statements about conditional independencies between variables resulting from the graphical structure. Whereas undirected graphical models such as the Ising/autologistic model [14] and the Boltzmann machine [56] do not make statements about the direction of edges, Bayesian networks may use this directionality to encode causal information [110]. While it may not be the case that every Bayesian network analysis requires causal interpretation of the links, it is often true that these are used to represent a causal mechanisms [64]. A sparser Bayesian network graph with fewer edges implies that there are fewer cross-variable correlations. Consequently, a sparsely connected Bayesian network composed of multiple smaller subgraphs implies a joint probability distribution that is more easily factorized into a small number of low dimensional CPDs. This modularity is greatly helpful in statistical inference as many algorithms such as the junction tree method [70, 109] and Gibbs sampling [58] are either easier to implement or computationally less expensive with additional conditional independence assumptions.

There is no restriction on the support of the random variables used, though discrete variables are often favored in some instances due to the existence of computationally efficient algorithms for exact inference in discrete Bayesian networks such as the junction tree [79] algorithm and variable elimination. While the junction tree algorithm can be extended to the class of conditional linear Gaussian models for continuous data [78], there appears to be no general-purpose and computationally efficient exact inference method for continuous Bayesian networks in a manner analogous to discrete Bayesian networks. In the discrete setting, perhaps the most commonly used parametrization of a Bayesian network CPD is a tabular one in which every element of \(Val(Pa^{\mathcal {G}}(X_j)))\) is associated with a distinct CPD for the values of \(X_j\). Another benefit of Bayesian networks is that their creation usually involves a structural decomposition of a probability distribution over several dimensions into lower dimensional conditional distributions. This bears close parallels to methods naturally used to improve decision-making via the decomposition of problems into smaller pieces [7]. Bayesian networks bear strong similarities to earlier attempts to introduce uncertainties into rule-based systems [64].

A common workflow for modeling data with Bayesian networks involves the steps of identifying a suitable model structure, applying prior information, performing parameter estimation, interpreting the parameters, and/or generating predictions for missing data or altogether new data points. These steps are explained in more detail in the context of spatial modeling below. A summary of the notation commonly used in this work is provided at Table 1.

Table 1.

\(\mathcal {G}\)		Directed acyclic graph \(\mathcal {G} = (\mathcal {X}, \mathcal {E})\) indicating dependencies between variables in \(\mathcal {X}\) with edges in \(\mathcal {E}\)
\(\mathcal {X}\)		Set of \(p\) random variables \(\lbrace X_1,\ldots ,X_p\rbrace\)
\(Pa^{\mathcal {G}}(\cdot)\)		Function mapping variables in \(\mathcal {X}\) to their parents as specified by DAG \(\mathcal {G}\)
\(\mathcal {S}\)		Set of spatial locations \(\lbrace s_1,\ldots ,s_n\rbrace\)
\(Val(X_j)\)		Sample space for random variable \(X_j\)
\(Val(Pa^{\mathcal {G}}(X_j))\)		Sample space comprising the Cartesian product of outcomes of variables in \(Pa^{\mathcal {G}}(X_j)\)
\(X(\mathbf {s}_i)\) or \(X^{(i)}\)		Random process \(X\) instantiated at location \(s_i\)

Table 1. Notational Definitions

Selecting a Structure. At the beginning of the modeling process, the analyst must determine which of the variables \(X_1,\ldots ,X_p\) are conditionally independent given all other variables. As the conditional independence relation is symmetric, there are \(p(p-1)/2\) possible conditional independence statements and a graph structure \(\mathcal {G}\) is created by selecting pairs of variables to be linked by a directed edge. If no sensible graph can be constructed a priori, or if it is desired that the graph itself be left as a free parameter, then structure learning must be employed to identify a suitable graph on the basis of an optimality criterion such as the Akaike or Bayesian information criterion, or the Bayesian Dirichlet-equivalent uniform score [55]. When working with spatial data, a complication that arises is that there does not appear to be a standard recipe for selecting a structure over both cross-variable and spatial modes of correlation. One may instantiate a Bayesian network for which every variable at every spatial location receives a node in the graph, but this implies an enormous space of potential graph structures. This can lead to a challenging graph selection problem even if only a single variable is used across a large spatial domain; we refer the reader to [123] for an example of structure learning under these conditions. Additionally, the edges in \(\mathcal {G}\) are directed, implying a parent-and-child ordering which may not conform with the commonly-held assumption in spatial statistics that spatial correlation is most easily captured via an undirected, unordered correlation model.

Choosing Conditional Probability Distributions. Once a suitable Bayes net graph has been identified, the next step is to construct functions mapping the outcomes of parent variables to probabilities of conditional outcomes of child variables dependent upon a CPD parameter vector \(\mathbf {\theta }\). For data comprising discrete parent and discrete child variables, a common choice appears to be a tabular CPD for which every element in \(Val(Pa^{\mathcal {G}}(X_j))\) has its own \(\vert Val(X_j)\vert\)-dimensional multinomial probability distribution for the values assumed by \(X_j\). With a tabular CPD, it is also possible to set or constrain [102] some of the parameter values to reflect prior information. For modeling spatial data in general, it is often desirable to allow for correlation between spatially proximal random variables. Therefore, a central question in the application of Bayesian networks to spatial data is the development of methods to produce CPDs which retain desirable properties such as simplicity and interpretability, while also modeling spatial correlation.

Parameter Estimation. The selection of BN parameters \(\mathbf {\theta }\) from a set of possible values \(\Theta\) for the CPDs from the previous section with the aid of observed data \(\mathbf {X}\) can be done via the calculation of an estimator as a solution to the following problems:

Maximum likelihood \(\mathop{\text{argmax}}\limits _{\mathbf {\theta } \in \Theta } p(\mathbf {X}\vert \mathbf {\theta })\)

Maximum a posteriori \(\mathop{\text{argmax}}\limits _{\mathbf {\theta } \in \Theta } p(\mathbf {X}\vert \mathbf {\theta })p(\mathbf {\theta })\)

Bayes estimator (mean square error) \(E[\mathbf {\theta }\vert \mathbf {X}],\) \(p(\mathbf {\theta }\vert \mathbf {X}) \propto p(\mathbf {X}\vert \mathbf {\theta })p(\mathbf {\theta })\)

with the posterior mean \(E[\mathbf {\theta }\vert \mathbf {X}]\) yielding the solution to minimizing the mean squared difference between true and estimated values of \(\mathbf {\theta }\), integrated over the prior distribution. Depending on the specific assumptions made for the Bayesian network, closed-form solutions may exist for any of the above problems. When there are missing data, expectation-maximization, and data augmentation algorithms [129] are frequently used for the maximum likelihood and posterior mean estimation, respectively. In general, a variety of techniques suitable for general-purpose Bayesian modelings such as Gibbs sampling [58] and variational Bayes [15] are also applicable to posterior inference in Bayesian networks. A notable challenge with spatial data is that when the assumption of independent and identically distributed data is relaxed to allow for spatial correlation, closed-form solutions may no longer exist. Furthermore, a very large range of algorithms for estimating the parameters of directed or partially directed graphical models exist. A broad review of parameter estimation for partially directed graphical models and Bayesian networks can be found in [120].

Interpretation. The values of the estimated parameters may convey information of interest regarding dependencies between variables and, therefore, are often the target of interpretation. For tabular CPDs, the conditional probability parameter estimates are interpretable as-is; parameter estimates yielding highly concentrated CPDs may convey additional conditional dependencies beyond those implied by the network’s graph structure.

Prediction. Much of the applied work covered later in this article emphasizes prediction as a key task for which Bayesian networks are employed. This may involve filling-in missing pieces of information for partially observed data, or generating predictions for new data. For geospatial analysts, it can be very convenient to have simple and efficient routines for generating predictions at every point in a large spatial domain, necessitating integration with a geographical information system (GIS) application.

Practitioners can benefit from spatial modeling by producing more reliable estimates of parameters obtained during inference as well as generating more accurate predictions. During parameter estimation, explicitly including spatial priors for parameters or error processes can help account for unobserved processes smoothly varying over space [133]. This addition is known to reduce bias in estimates of model parameters in parametric models such as linear regressions [80] and is likely to also help provide better estimates for parameters associated with Bayesian network models as well.

Related Work. For a more extensive coverage of the theory and application of Bayesian networks, we recommend the tutorial and overview papers by [20] and [112] as well as books and book chapters by [109], [74], and [72]. Existing review articles for Bayesian networks applied to environmental risk assessment [68], resource management [10], ecosystem services [76], and environmental modeling [4] have substantial overlap with this work as spatial data is highly prominent in the ecological and environmental sciences. Additionally, we note that the material covered in this review shares much in common with existing commentary on Bayesian network-GIS integration [67]; in this work, we do not focus on model-software coupling, instead of framing our review around the models and workflows used by researchers to match the advantages of Bayesian networks to the problem at hand involving spatial data. However, the scope of research covered by this review is not as large as the most general class of models described in part as Bayesian networks; we omit discussion of dynamic Bayesian networks (DBNs) as there exists a substantial research and pedagogical literature on spatiotemporal Bayesian modeling [30, 140]. We note for researchers beginning to work with spatiotemporal data, it may not be entirely clear precisely where and why spatial and temporal data must be approached with different modeling strategies. This confusion may be further compounded by shared terminology for modeling techniques such as temporal autoregressions in the time series literature and conditional autoregressions for the spatial statistical literature. As an aside, we note that DBNs have been applied to data with spatial domains, albeit in only a single dimension [97]. For readers unfamiliar with the nomenclature used for classifying graphical models, DBNs [51, 96] subsumes many temporal probabilistic models including hidden Markov models and linear Gaussian models which are amenable to closed-form exact Bayesian inference using methods such as the Viterbi algorithm or Kalman filtering. These efficient algorithms are made possible by the fact that a causal directionality across time can naturally be inferred. Thus, such systems are easily represented as directed graphical models. However, existing probabilistic models of spatial systems such as conditional autoregressions and Gaussian processes (GPs)do not share the same property because, in general, it is not reasonable to assume that a total ordering exists on a set of spatial coordinates associated with observations. Put more plainly, there is no a priori reason to believe causality moves from east to west in the same way that it flows from past to present. This precludes naive usage of methods designed for temporal phenomena for spatial data and thus explains why the two subjects diverge into distinct subfields; the former assumes a natural directed graphical structure due to the asymmetry of time flowing from past to present, while the latter favors probability models capable of respecting the homogeneity and isotropy of space. Special cases may deviate from this rule; if there is known to be a strong directionality associated with certain spatial axes due to differences in elevation or consistent atmospheric patterns, it may make sense to assume some type of ordering amongst locations. Another exception to this rule is found in [34] in which a spatial model is created by averaging over all possible orderings of spatial locations is done to erase dependency upon a single ordering (Figure 3(C)).

We briefly describe some alternative approaches to probabilistic modeling of spatial data. As the usage of spatial Bayesian networks intersects substantially with spatial statistics, we find it helpful to recall that within the field of spatial statistics, spatial data usually fall into one of three categories [29]: areal data which is located on a lattice or network, geostatistical data which exhibits cross-correlation dependent on interpoint distances, and point pattern data in which the location of each point is itself a stochastic process. Examples of spatial statistical models designed for each type of data include autoregressions [12], GPs/kriging [90], and Poisson processes, respectively. Recently, analysis of richer functional spatial data such as movement trajectories has also become an object of study within spatial statistics [57]. It is important to note that while many usages of spatial models deal with geographical or physical notions of space, spatial models are generally appropriate for data, which has coordinates in any space endowed with a distance metric. Some representative scenarios include data in non-Euclidean spaces [31] with a treelike topology [62, 137] or data associated with angular coordinates. GLMs with adjustments to include spatially correlated random effects [14] or spatially correlated error variables are a mainstay in spatial econometrics [5] and in spatial Bayesian modeling [13]. The primary differentiating factor between spatial GLMs and Bayesian networks is that the former only models dependencies between the covariates and the response variable though their relative simplicity can be attractive for cases in which Bayes net features are unnecessary, while Bayesian networks typically model the joint correlation structure between all variables simultaneously. We note that geostatistical GP models are closely related to spatial GLMs as the prior distribution for a spatial component of a GLM can be represented using a GP defined on a spatial domain [37]. When a large number of variables are observed simultaneously at many locations, it can be desirable to produce a lower-dimensional summary of the dataset in terms of spatially-smoothed principal components or factor loadings; spatial PCA [35] and spatial factor analysis [132] extend preexisting models to allow for spatial autocorrelation among factors. Spatial variants of machine learning models including neural networks [125] and random forests [50, 139] also exist; these models typically do not admit a simple interpretation of their parameters and also have little facility for handling missing data in comparison to Bayesian networks.

The objective of the remainder of this article is to help practitioners of probabilistic modeling for spatial data further understand existing research as it relates to the usage of Bayesian networks and identify promising future areas of research to help advance the field. To this end, we have structured this review article into sections separately discussing Bayes net models designed for spatial data, areas of application, and an analysis of promising research opportunities on the topic. A major challenge in using Bayesian networks absent strong prior knowledge is the identification of the network structure from data. This review will not include an in-depth discussion of structure learning despite its prominence [28, 40, 46] in research on Bayesian networks. Since spatial data is naturally structured, albeit, in an undirected or symmetric manner, it does not appear to pose any special risks or opportunities for learning an unknown graphical structure from data. Additionally, we have been careful to screen out work that uses the descriptor spatial without any substantial content relevant to spatial probability models or a spatial data workflow.

To aid the reader in understanding the myriad ways that researchers and practitioners have made use of Bayesian networks with spatial data, we constructed a pair of diagrams (Figures 2 and 3) with reference to a hypothetical scenario described in Figure 1, which captures the main dichotomy we find in our review; we can cleanly separate most work covered into (1) modifications of probability models to be faithful to spatial correlations exhibited in the data, and (2) modeling workflows using mostly off-the-shelf Bayesian network tools, with adjustments to the type of data or number of models employed to capture some aspects of the spatial nature of the problem.

Fig. 1.

2 Bayesian Network Models for Spatial Data

The diversity of scientific fields which have used Bayesian networks for applied work is matched by the range of methods employed to integrate spatial information into their models. Broadly speaking, these can be grouped into two classes: those that use pre-inference or post-hoc methods for including spatial data, and those that directly build it into the correlation structure of a probabilistic model and thus require methods for inference which are aware of this model structure. These two categories correspond to the spatial workflows (Figure 2) and model modifications (Figure 3). A major point in favor of using Bayesian networks for applied studies is that their directed acyclic graphical representation often bears close similarities to the mental picture created by non-expert users and is therefore relatively easy to understand. With this fact in mind, it is natural for spatial data analysts to attempt to retain as many attractive features of Bayesian networks as possible when working with spatial data, sometimes leading to ad hoc model or workflow modifications attempting to account for spatial autocorrelation without creating a new probability model that directly represents autocorrelation such as an autoregressive or GP model. This is not to imply that this class of users are not aware of the issues posed by such an approach; rather, they have often found methods for analyzing data, which circumvent this issue of model design or cases in which the shortfalls of a non-model approach are not especially damaging. We find that the above phenomenon is precisely the case in the analysis of interspecies ecological interaction networks as presented by [134]. In that work, the authors use a spatial average as a parent variable despite this inducing a cyclic dependency structure. The authors note that in real-world applications, failure to account for spatial autocorrelation leads to suboptimal predictive accuracy likely due to a global model not making use of informative local data, especially if model predictions are not spatially consistent with the local neighborhood (Figure 4). However, even when modeled forecasts or predictions are well matched to prior beliefs about spatial consistency, it may be desirable to use spatial reasoning to simplify model structure and reduce the number of nodes within the network. In general, we can view this process as starting out with a full graph over one or more spatially indexed variables and making assumptions about edges that are allowed or disallowed on the basis of proximity [18, 32, 83, 122, 135]. This procedure can result in graphs with large numbers of nodes and edges [22], which are still amenable to rapid inference if the number of parents and children per node is relatively low (Figure 2(C)).

Fig. 2.

Fig. 3.

Fig. 4.

2.1 Spatial Information as Nodes

A simple option for including spatial information in the modeling process is to simply encode each data point’s spatial coordinates as variables. Reference [124] used latitude and longitude directly as parent variables. Another method for encoding spatial information about proximity to features of interest is to use a kernel smoother or similar operation to translate point-referenced data into a field (Figure 2(D)) with values defined at every point in space; Reference [81] used this strategy to allow geographic distance from seismic hazards to inform a Bayes net model of catastrophic risk due to natural disasters.

2.2 Spatially-Averaged Priors

Multiple studies [43, 54, 134] within the ecology literature have dealt with the issue of spatial autocorrelation by beginning with a Bayesian network for data over multiple sites and adding nodes representing local or neighborhood averages which act as parents for the site-specific node value (Figure 2(B)). For example, [43] built a population model which used a distance-weighted local population average as the parent node for a site’s population. There are several advantages to doing this; it can be done as a preprocessing step and it is conceptually simple. However, we see a number of disadvantages incurred by this approach. From a theoretical standpoint, this implies that for two locations \(s_1\) and \(s_2\) and the local average random variable Z acting as a parent of both \(X(s_1) and X(s_2)\), \(X(s_1)\) must be calculated as a function of \(X(s_2)\) and simultaneously \(X(s_2)\) is a function of \(X(s_1)\). This symmetry between \(s_1\) and \(s_2\) must necessarily imply that if there is a directed edge from \(X(s_1)\) to \(X(s_2)\), then there must also be the same in reverse, creating a directed cycle of length 2 (Figure 5). This violates core BN assumptions that the graphical model does not contain cycles of any length. When this occurs, it is more appropriate to use inference schemes, which work well for directed cyclic graphs like loopy belief propagation [59]. From a more practical standpoint, using a spatially smoothed parent node poses another obstacle in that it does not provide a fast, exact recipe for generating predictions at new sites which are lacking informative neighbors. In the example previously listed, suppose sites \(s_1\) and \(s_2\) are adjacent with unknown values of a single random variable X which must be predicted. Since \(X(s_1)\) and \(X(s_2)\) are parents of each other, we are unable to use forward sampling, i.e., sampling the marginal distribution \(X(s_1)\) followed by \(X(s_2) \vert X(s_1)\), to generate draws from the joint distribution of \(X(s_1), X(s_2)\). Forward sampling is a computationally cheap and easy to apply Monte Carlo to the problem of creating a predictive distribution for X at \(s_1\) and \(s_2\) and without it, we would have to resort to Gibbs sampling or another iterative technique to draw predictions for X. Since this functionality is not present in many commonly used Bayesian network platforms, users are restricted to generating predictions on subsets of locations comprising sites which have at most one missing value within their local neighborhood.

Fig. 5.

2.3 Regional Grouping

An alternative approach for spatial data which is very easy to implement is to take spatially grouped observations and construct individual Bayesian network models for each group (Figure 2(E)). This is the approach taken in [24] for modeling the economic value of trees in different zones demarcated by levels of air pollution and in [65] for detecting latent disease outbreaks. A major challenge for spatial modelers is to identify a partitioning of the spatial domain into a set of observational units which are sensible for modeling. It is known [63] that the modifiable areal unit problem of differing inferences depending upon the scale of aggregation can pose an obstacle to an objective, reproducible analysis. While this is often raised as an issue in more conventional spatial statistical analyses, Bayesian networks are also subject to this phenomenon [94]. A grouped data approach has the advantage that it will tend to respect local spatial structure better than a single global model and it can be done by simply repeating the analysis that would have otherwise been done on the entire dataset for multiple repetitions. The primary drawbacks are threefold. First, the grouping or partition must be known ahead of time, and second, even though individual observations may be spatially proximal, if they are divided into different groups, there will be no prior notion of spatial continuity encoded into their joint distribution. Finally, treating separate groups of data as independently arising from separate data-generating processes substantially increases the number of parameters and does not allow for the pooling of information across groups; this pooling can be especially useful when the number of observations per group is relatively small or there is a substantial amount of measurement noise. While it does not employ Bayesian networks in a spatial context, the multilevel Bayes net framework developed by [77] implements this grouping approach but also pools information from across groups in a manner akin to a multilevel regression. For example, in the context of the toy problem from Figure 1, one may wish to study several different habitat regions for which the exact nature of the interaction between fox and rabbit may be somewhat different, i.e., in a forest, the rabbit is less threatened by the fox than in a flat grassland or in a wetland. Then, it would be desirable to have several Bayesian network models with parameters linked by a hierarchical prior so that the per-county CPD parameters are centered around a common average. This offers multiple advantages: when data is relatively sparse for each of the individual regions, they may share information to reduce the variance of parameter estimates. Another advantage is that, relative to a single model for all regions’ data, it allows for variation across the per-region CPD parameters so that an analyst may attempt to infer something about the relative difficulty experienced by a fox in hunting a rabbit in each region. In a regression modeling framework, the analogy would be to have a hierarchical prior over the slopes and intercepts of a linear model instantiated at each region.

While many applications of multilevel modeling only consider a two-level hierarchy of random variables clustered around a common average, there is no methodological obstacle to using further levels of aggregation or refinement. Consequently, tree-structured Bayesian networks [16, 44] make use of this fact to build up spatial field as a recursive splitting (Figure 3(A)) of nodes at a coarsely resolved spatial scale into more and more finely resolved nodes. Extensions to this model [2] allow for structure learning of the depth and location of variable splittings.

2.4 Partially Directed Graphical Models

With the proliferation of approximate Bayesian parameter estimation methods such as Markov chain Monte Carlo (MCMC), nested Laplace approximations [116] and black-box variational inference [113] that can be applied to models across a wide range of distributional and structural assumptions, a natural avenue of research would be the synthesis of novel model forms which combine the best attributes of Bayesian networks and spatial statistical models. As the former consist of directed acyclic graphs and the latter often use an undirected (Figure 2(A)) representation, mixed graphical models with both directed and undirected connections are of special interest. These models are typically described as chain graphs or partially directed acyclic graphs (PDAG) with the term chain referring to subsets of variables linked together with undirected edges. In a crude sense, we can think of the chains as forming the interstitial glue that binds together the directed Bayesian network components. If these chains happen to correspond to random variables with spatial correlation structure, then it is possible to construct a BN-like model with directed inter-variable connections and allow for inter-site correlation between observation of the same variable at different locations using isomorphic chain graphs [52]. Informally, these graphs are described as isomorphic because the per-site graphical structure is preserved across sites (Figure 2(B)), though local variations in their conditional probability functions are allowed. In situations involving spatial data, it may not always be immediately clear which level of the model should be endowed with a spatial correlation structure. Fortunately, related work [60, 61] shows how to select variables for spatial modeling based on whether they are exogenous or endogenous to the system of study. The previous results were shown mostly in the context of chain graph models for real-valued data; Reference [21] extends this work to include joint modeling of both discrete and continuous random variables with varying types of correlation structure. The integration of Bayesian networks with undirected graphical models via chain graphs enables the synthesis of novel model forms which have much richer graphical structure than before. With the proliferation of large datasets obtained using spatially distributed sensing, there is substantial research interest in developing methods for analyzing spatial data with computing requirements that scale well [33, 126]. Integration of Bayesian networks with scalable geospatial models could be an attractive area of research for users desiring both interpretable graphical structure and spatial pooling of information.

The combination of directed and undirected model components naturally integrates the DAG-structured Bayesian network with an undirected spatial model. However, we note that there are advances in purely undirected graphical models that are similar in spirit to the types of analyses done with Bayesian networks, in the sense that multiple variables are modeled across space with a sparse cross-variable covariance structure. The literature on Gaussian models for real-valued data includes instances of researchers exploring covariance structure (Figure 3(D)) factorized as the Kronecker product of a spatial and a cross-variable covariance matrix [39, 49] or as a similarly factorized space-variable graph product [36].

Spatial structural equation models with PDAG structures commonly used in the social sciences [85] are very similar to Bayesian networks [111] as both methods presuppose the existence of a graphical dependency structure, which constrains the possible covariance matrices relating to the system variables. The two differ primarily in their interpretation and usage; structural equation models are often mentioned in the context of latent variables which represent unobserved sources of variation. Furthermore, structural equation models are often used for continuous data in a linear Gaussian framework while Bayesian networks are commonly used with discrete data. However, it must be noted that there does not appear to be any fundamental reason for this discrepancy. Spatial methods have found application in structural equation modeling via spatially smoothed priors on latent variables [26, 105]. In each of these cases, the spatial prior is assumed to be a Markov random field. The next section highlights applications of spatial Bayesian network models to a range of application areas.

3 Application Areas

As an overview article, the chief aim of this document is to help researchers identify connections and links to other disciplines which may use spatial Bayesian networks in a similar fashion. In this section, we discuss the various software tools and types of applications considered for these types of analyses.

An important factor in the widespread use of Bayesian networks in the applied sciences is the existence of several software packages focused on model construction, inference, and interpretation. Commercial software platforms which exist for modeling data with Bayesian networks include Hugin [88], Analytica, and Netica [141]. Some of these platforms include direct integration with GISs software for analyzing spatial data. Proprietary software Hugin has a plugin for QGIS, Netica has its own within-platform GIS system, and BayesiaLab [27] has partial integration with Google Maps. Standalone software packages such as bnspatial [89] offer useful utilities for directly ingesting raster and shapefile data into a Bayesian network learning and predictions framework. For users interested in an open-source direct GIS-Bayesian network integration, [128] describes the creation of an online web server for analyzing geospatial data with Bayesian networks. A commonality between all of these platforms is that, while they treat spatial variables as data that can be visually presented in geographic space, there is no change to the underlying probabilistic model. Therefore, standard approaches to parameter estimation such as the junction tree algorithm and Gibbs sampling are used. Frameworks not explicitly designed for geospatial data such as bnlearn may be straightforward to integrate with open source GIS software though they do not appear to be able to model spatial autocorrelation. The Bayes Net Toolbox [98] and at least one other Bayesian modeling framework [86] used for MCMC for Bayesian networks have the ability to represent spatial processes [131] within the same model as a Bayesian network, but we are unable to find any existing studies which use these software packages to implement a spatial Bayes net approach. Reference [117] implements a software framework for combined Monte Carlo/gradient descent procedure for training a wide array of spatial graphical models encompassing spatial Bayesian networks as a special case. General-purpose Bayesian modeling frameworks such as Stan [19] and PyMC3 [121] also appear to have all the necessary software components to perform inference for both Bayesian networks and spatial models though there are not any published studies using them for Bayesian networks at this time.

Geographic data indexed with coordinates on the Earth’s surface are abundant in research applications of Bayesian networks for spatial data and are especially common in ecology and environmental science. Both fields are the focus of an especially rich literature in Bayesian modeling [25] as this framework naturally accommodates inference for a range of flexible modeling forms via MCMC and pool information across multiple noisy sources. Reviews of the usage of Bayesian networks for environmental data are given in [10, 136]. A salient point made by Uusitalo et al. [136] is that for spatial or temporal data, using Bayesian networks often requires instantiating a new network or many new nodes for every point in space or time and that this task can be very tedious or may lead to an unwieldy graphical model [92]. It has been observed [10] that a commonly used approach is to simply ignore the temporal or spatial coordinates associated with each observation and to consider the dataset as comprising independently distributed data points. This is a frequently adopted independent mode of processing that offers no methodological difficulties in most Bayesian network frameworks. Another advantage of making this independence assumption is that obtaining parameter estimates and generating forecasts on new data is very straightforward via the use of variable elimination or the junction tree algorithm which rely critically on the assumption that variables have a relatively sparse dependency structure. The analyst may also opt to not discard spatial information entirely, but to use axes of spatial variation such as latitude, longitude, and altitude as separate variables on their own [138] as depicted in Figure 2(A). Usage of Bayesian networks for prediction to generate maps of probabilities or risk is commonplace, having been done to produce raster imagery categorizing regions according to the probability of land use transition in Scotland [1], deforestation risk in Swaziland [38], and a per-segment basis for shoreline erosion risk in Vanuatu [118] and persistence of trout along stream reaches in the western USA [114]. Using a parallel analysis is also attractive for use with remote sensing; for example, [69] employed a Bayesian network to estimate leaf area index from Landsat 7 imagery and [108] performed land use classification for stormwater management. Bayesian networks used in a parallel analysis may also be embedded as part of a larger simulation approach to generate maps of risks or probabilities conditional on the outputs of a physical model [95].

Interestingly, Bayesian networks can also be used as submodel components in larger, spatially explicit simulations of environmental systems (Figure 3(B)). Reference [84] shows how to use a Bayesian network to parameterize the flow of information between fishing vessels in an agent-based model (ABM), which moves across a spatial grid and interacts with its fish population. Reference [3] likewise uses a dynamic Bayesian network as part of a spatial simulation of a predator-prey model with rich ecological dynamics. Reference [71] similarly modeled population change and movement with a Bayes net parameterization of an ABM. As Bayesian networks present desirable properties for risk analysis [45] and decision-making under uncertainty [103], they have found substantial use for the prediction of natural hazards and their associated risks. Within this field of study, Bayesian networks have been used to predict wildfire occurrence [107] and resulting land cover impacts [127], earthquake risk [82], urban flood risk [9], and risk of damage due to avalanches [53]. The field of epidemiology likewise has produced several new studies generating spatially explicit predictive risk maps [6, 54, 91] and identifying or predicting disease outbreaks [11, 65].

All of the studies referenced so far in this section use a geographic notion of space in which distances are on the order of meters or more and observations can be uniquely linked to locations on Earth’s surface. It is also possible to consider coordinates within non-geographic image data as a valid interpretation of the term “spatial data”. For example, neuroscientists constructed a Bayesian network model over a spatial domain over the surface of the human eye [135]. Several tasks in computer vision such as modeling image perception [122], image segmentation [2, 16, 44], and image classification [23, 106] can be accommodated within a Bayesian network model. However, over the past decade, research focus has shifted somewhat to more data-driven approaches using deep neural networks [75, 115], which eschew incorporation of expert knowledge into models structure and instead favor discipline-agnostic best practices and heuristics. Possible directions of advance for further research into methods for analyzing spatial data with Bayesian networks are presented in the next section.

4 Potential Future Directions

Despite the natural framing of Bayesian networks as a subset of graphical models, research at the intersection between Bayesian networks and undirected models appears to be relatively limited in the literature. There are good potential reasons for this absence; many users find that their usage of Bayesian networks with spatial data suffices for their modeling needs. Additionally, the widespread usage of proprietary platforms for this type of data analysis may hinder integration with open-source modeling frameworks with broader functionality. It is possible to imagine use cases in which it is possible to partition a system into (1) a set of processes for which there is excellent prior knowledge and causal mechanisms are well identified, and (2) variables that are important for decision making but poorly understood or not of interest on a mechanistic level. In such a case, it may be desirable to combine the Bayesian network structure for the first class of variables with data-driven models for the second. Irvine et al. [61] noted that for variables, which have few or no parents within the network model, spatial correlation between observations of their values across sites can be interpreted as exogenous spatial trends. These variables may be well suited to representation with a multivariate spatial process [47] allow for correlation both in space and between different types of variables. This could be especially useful in using Bayesian networks to conduct inverse modeling in which child nodes are observed but the parent nodes are obscured for some spatial regions. The richness of the spatial processes considered can vary from relatively simple probabilistic constructions to much more flexible joint distributions.

An alternate scenario that often arises in regression analyses of spatial data is that the relation between predictor and response variables is itself a function of space [17, 48]. In such a situation, it is expedient to use models which share observational information to pool estimates around a common mean but also allow for local variation in the model coefficients. In the context of the example from (Figure 1), the posterior conditional probability \(P(Rabbit \vert Fox)\) modeled across n distinct regions might have a distribution centered around 0.3, with a standard deviation of 0.1 to allow for some limited heterogeneity in cross-variable interactions in space. An analogous Bayesian network approach would be to place hierarchical priors [8] on the columns of conditional probability tables (for discrete variables) or coefficients (for continuous variables) which are then linked together with a prior that enforces spatial smoothness. On a related note, it may be possible to encounter a scenario in which a Bayesian network is desired for a relatively large area of analysis for which expert opinion may be elicited with confidence for only a subset of that area. Then, additional model alterations may be needed to make use of that information in an appropriate manner.

While there appear to be no fundamental theoretical obstacles to integrating Bayesian networks into larger model forms, the details of implementation and inference can appear formidable. It is clear that to make these jumps, algorithms which are designed to work solely on relatively small directed acyclic graphs will have to be replaced by generic inferential methods that work on a wide variety of models. Gibbs sampling has served this role well and subsequent developments in MCMC such as Hamiltonian Monte Carlo [42, 100] require only calculation of a model log posterior density as well as its gradient with regard to model parameters. As a result, we have the ability to use MCMC [101] on a very broad range of models incorporating neural networks, geostatistical processes [130], or very large datasets [87]. Since Bayesian networks typically have a real-valued parameter space even if their data is discrete, gradient-based MCMC is likely suitable for parameter estimation in a range of novel, BN-like forms with continuous parameterizations. Additionally, recent work suggests that this sampling scheme can be extended to include discrete parameters as well [104]. We hope that advances in integrating more flexible and more highly parameterized machine learning models with spatial statistics [125, 139] can be matched by future developments in Bayesian networks.

5 Conclusion

The usage of data with spatial correlation provides an option for potentially more accurate predictions by accounting for the influence of proximal data points. Bayesian networks are widely used with spatial data, though the recognition of the special dependency structures accompanying these datasets is not always present. Multiple avenues exist for representing spatial dependencies which can broadly be characterized into either spatially explicit probability models or post-/pre-processing workflows to implement ad hoc spatial adjustments. Recent developments in inference algorithms for broad classes of probabilistic models make the integration of spatial and nonspatial modeling components more straightforward and offers a wealth of opportunities for developing models more closely tailored to the needs of spatial analysts. We hope that this article stimulates and encourages further studies examining the potential for enhancing the usefulness of Bayesian networks for structured data in a range of contexts.

Acknowledgments

We would like to thank Marie Urban of Oak Ridge National Laboratory for helpful discussions and guidance.

References

[1]

Inge Aalders. 2008. Modeling land-use decision behavior with Bayesian belief networks. Ecology and Society 13, 1 (2008), Article 16.

Abstract

1 Introduction

2 Bayesian Network Models for Spatial Data

2.1 Spatial Information as Nodes

2.2 Spatially-Averaged Priors

2.3 Regional Grouping

2.4 Partially Directed Graphical Models

3 Application Areas

4 Potential Future Directions

5 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Data augmentation strategies for the Bayesian spatial probit regression model

Spatial structure analysis of a reptile community with airborne LiDAR data

Structural learning of mixed noisy-OR Bayesian networks

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations