Open AccessArticle

Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications

Sandia National Laboratories, Albuquerque, NM 87185, USA

Klipsch School of Electrical and Computer Engineering, New Mexico State University, Las Cruces, NM 88003, USA

Author to whom correspondence should be addressed.

Mathematics 2024, 12(21), 3392; https://doi.org/10.3390/math12213392

Submission received: 12 September 2024 / Revised: 17 October 2024 / Accepted: 25 October 2024 / Published: 30 October 2024

(This article belongs to the Special Issue Artificial Intelligence and Data Science)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a probabilistic model for various machine learning (ML) applications. While deep learning (DL) has produced state-of-the-art results in many domains, DL models are complex and over-parameterized, which leads to high uncertainty about what the model has learned, as well as its decision process. Further, DL models are not probabilistic, making reasoning about their output challenging. In contrast, the proposed model, referred to as Yet Another Discriminate Analysis(YADA), is less complex than other methods, is based on a mathematically rigorous foundation, and can be utilized for a wide variety of ML tasks including classification, explainability, and uncertainty quantification. YADA is thus competitive in most cases with many state-of-the-art DL models. Ideally, a probabilistic model would represent the full joint probability distribution of its features, but doing so is often computationally expensive and intractable. Hence, many probabilistic models assume that the features are either normally distributed, mutually independent, or both, which can severely limit their performance. YADA is an intermediate model that (1) captures the marginal distributions of each variable and the pairwise correlations between variables and (2) explicitly maps features to the space of multivariate Gaussian variables. Numerous mathematical properties of the YADA model can be derived, thereby improving the theoretic underpinnings of ML. Validation of the model can be statistically verified on new or held-out data using native properties of YADA. However, there are some engineering and practical challenges that we enumerate to make YADA more useful.

Keywords:

machine learning; explainability; probabilistic model; synthetic data; uncertainty quantification

MSC:

68T01; 68T37; 62H05

1. Introduction

Deep learning (DL) has achieved state-of-the-art results in several domains [1,2,3,4]. However, DL techniques produce very complex models due to the large number of parameters and nonlinear relationships that are represented, making them difficult to interpret [5], susceptible to adversarial manipulation [6,7], and overconfident in their predictions [8,9]. As DL is being considered in more high-consequence domains, several research fields are emerging, including explainability [10,11], ethical AI [12], and uncertainty quantification [13,14], to quantify the risks associated with using DL. Herein, we examine the application of a probabilistic model that is straightforward, can be used for a wide variety of ML tasks, including classification, explainability, and uncertainty quantification, and is competitive in most cases with many state-of-the-art DL models. The proposed probabilistic model represents both the marginal distributions and pairwise correlations from the dataset, and is an extension to similar models in engineering disciplines [15,16] and the Gaussian copula in finance [17,18,19], medicine [20,21], geomechanics [22], climate [22,23,24], and astronomy [25]. The use of copulae in machine learning has thus far been limited; for example, ML classification has been studied [26].

Often in ML applications, the input features are modeled (at least theoretically) as random variables. However, DL does not produce probabilistic models, making reasoning about their outputs and quantifying their uncertainty challenging [27]. Bayesian neural networks seek to build probabilistic models, but they are expensive to train and increase the overall complexity of the model [28]. In problems of practical interest, the features/variables are interdependent, often times in very complex ways, and the full joint distribution function is required to correctly represent them. However, calculating and implementing a model for the full joint distribution from data are not feasible, even with a relatively small number of features. One solution that is often pursued is to assume that the features are mutually independent, so that the joint distribution is equal to the product of the marginal distributions, which can easily be estimated from data. This assumption is very limiting and can lead to poor results.

We present a model that captures the marginal distributions and pairwise correlations from the dataset, referred to as Yet Another Discriminant Analysis or YADA, which can be viewed as an intermediary step between modeling just the marginal distributions and modeling the full joint distribution. A list of benefits of YADA includes the following:

YADA is less complex than other methods and yet is based on a mathematically rigorous foundation. It is straightforward to derive, for example, joint distributions, joint likelihood functions, joint entropy, and K-L divergence, among others.
YADA can represent continuous or discrete-valued features with arbitrary marginal distributions, i.e., the normality assumption is not required.
Pairwise correlations between features are well represented, i.e., an assumption of mutual independence is not required.
Given an unlabeled test point, YADA can be used to assess the joint likelihood that the point came from any of the relevant classes. If the likelihood of the point is sufficiently small for all classes, this is an indication that this point is out of distribution and any predicted label should not be trusted. This can be used as a measure for confidence in the prediction that is lacking in standard ML.
YADA has the ability to utilize the empirical distribution combined with an extrapolation model for the left and right tails for more refined feature modeling.
The YADA model provides a mapping between the feature space and the space of multivariate normal (MVN) random variables, which can be very useful because many calculations are straightforward in the MVN space.
YADA provides ML explanations based on the marginal likelihood for each feature, and can be used to create realistic synthetic data to address, for example, class and/or feature imbalance.

YADA is an extension to the Gaussian copula due to its use for explainability, out-of-distribution detection, prediction confidence, refined feature modeling, and synthetic data generation. While there have been some limited studies of Gaussian copulae for ML applications, we believe this paper to be the first survey in the broad use of the approach and inspire future research in this area.

Despite these benefits, YADA has several limitations. First, it can be less accurate for classification than large, complex DL models. Second, YADA captures only marginal distributions and pairwise correlations among features. Higher-order information, such as the full joint distribution, may not be captured accurately. Further, YADA operates only on the known feature vector (as opposed to DL models, which can learn their own set of features through the layers of the network). We view YADA as a first step to providing a theoretic model that would be suitable for high-consequence applications. Future work would address these short-comings of YADA to improve its performance goals.

The outline of our paper is as follows. The YADA model is described in detail in Section 2, with subsections on definitions and model properties. YADA for machine learning applications is presented in Section 3, with sections on training, synthetic data generation, classification, and uncertainty estimation; some of these topics require additional details, which are presented in the attached Appendices. Some concluding remarks and thoughts about future work are provided in Section 4. Examples with commonly available datasets are provided throughout the discussion to further illustrate the approach.

2. The YADA Model

Let

x = {(x_{1}, \dots, x_{d})}^{T}

denote a vector of

d \geq 1

features with the corresponding class label

y \in {0, \dots, κ - 1}

, where

κ

denotes the number of classes. Often, in machine learning applications, the features are modeled as random variables; we use the notation

X = {(X_{1}, \dots, X_{d})}^{T}

to represent a vector of random variables used to model the feature vector

x

. In this section, we present a probabilistic model for

X

, which we call Yet Another Discriminant Analysis (YADA), that can be used to represent correlated features with arbitrary marginal distribution.

The main premise behind the YADA model is to represent each feature as a function h of a standard normal or Gaussian random variable. This allows for the marginal distribution of each feature to be controlled by the functional form of h, while the correlation of the underlying d Gaussian random variables controls the correlation between features. Both continuous and discrete-valued features can be modeled in this way.

Hence, the YADA model captures the pairwise correlations and marginal distributions from the dataset of interest. Many probabilistic models assume that the features are either normally distributed, mutually independent, or both, which can limit their performance (see Figure 1). While YADA does not have these limitations, it is important to acknowledge that there is no guarantee that higher-order information, such as joint distributions of any two or more features, will be represented accurately.

The definition of the YADA model is presented in Section 2.1, followed by a derivation of some properties of the model in Section 2.2. Examples using standard machine learning datasets are included throughout for demonstration.

2.1. Definition

We define the YADA model in this section. We start with a single continuous feature to simplify the discussion, then generalize the model to the case of two or more features that can be continuous or discrete.

2.1.1. Single Feature

Consider first the case of a single feature

x = x

, and let X be a continuous random variable that represents a model for x. Further, let

G \sim N (0, 1)

denote a standard normal or Gaussian random variable with zero mean, unit variance, and probability density function (pdf)

ϕ (u) = (1 / \sqrt{2 π}) e^{- u^{2} / 2}

- \infty < u < \infty

. The proposed YADA model for feature x is given by

\begin{matrix} X = h (G) = F^{- 1} (Φ (G)), \end{matrix}

(1)

where F is an arbitrary cumulative distribution function (cdf) and

Φ (z) = \int_{- \infty}^{z} ϕ (u) d u

denotes the cdf of G. Because F and

Φ

are both cdfs (monotonic functions),

h = F^{- 1} \circ Φ

is invertible, that is,

\begin{matrix} G = h^{- 1} (X) = Φ^{- 1} (F (X)) . \end{matrix}

(2)

We can therefore interpret G defined by Equation (2) as the Gaussian image of X.

In general, the functional form of h is nonlinear so that X defined by Equation (1) is a non-Gaussian random variable, and it is simple to show that F is actually the cdf of X. This follows because

\begin{matrix} Pr (X \leq a) = Pr (h (G) \leq a) = Pr (G \leq Φ^{- 1} (F (a))) = Φ (Φ^{- 1} (F (a))) = F (a) . \end{matrix}

For example, let X be an exponential random variable defined by cdf

F (x) = 1 - e^{- λ x}

x \geq 0

, where

λ > 0

is a parameter. By Equation (1), X can be expressed as

h (G) = - (1 / λ) ln (1 - Φ (G))

because

F^{- 1} (y) = - (1 / λ) ln (1 - y)

is the inverse cdf of X. Similarly,

X = h (G) = a + (b - a) Φ (G)

, with

a < b

, is a uniform random variable on the interval

[a, b]

. These examples are illustrated in Figure 2. On the left is a histogram of 10,000 samples of a standard Gaussian random variable G. These samples can be mapped to samples of an exponential or uniform random variable using the mapping function, h, described above. The model defined by Equation (1) is referred to as the Nataf transformation [29] or translation random variable [15].

2.1.2. Two or More Features

Next, consider the more general case of

d > 1

continuous features. The YADA model for the ith feature is given by

\begin{matrix} X_{i} = h_{i} (G_{i}) = F_{i}^{- 1} (Φ (G_{i})), i = 1, \dots, d, \end{matrix}

(3)

where each

G_{i} \sim N (0, 1)

is a normal or Gaussian random variable with zero mean and unit variance, and each

h_{i}

is an invertible function. The Gaussian variables

{G_{1}, \dots, G_{d}}

are correlated and have joint pdf:

\begin{matrix} ϕ_{d} (u_{1}, \dots, u_{d}; c) = ϕ_{d} (u; c) = {(2 π)}^{- d / 2} \det {(c)}^{- 1 / 2} e^{- 1 / 2 (u^{T} c^{- 1} u)} \end{matrix}

(4)

where

c = {E [G_{i} G_{j}]}

is a

d \times d

correlation matrix with determinant

\det (c)

; Equation (4) is also referred to as the multivariate normal pdf. Note that

c

is symmetric and positive definite with ones on the diagonal. The model described by Equation (3) is referred to as a translation random vector [16] and Gaussian copula [17].

The inverse of Equation (3) is given by Equation (2) with the addition of subscript i and provides the marginal distribution for

G_{i}

, the Gaussian image of

X_{i}

. However, when applying this inverse mapping, there is no guarantee that the collection

{G_{1}, \dots, G_{d}}

will be jointly Gaussian.

The connection between the feature space and the space of multivariate Gaussian random variables is an important property of the YADA model, and one that is exploited in the following sections for classification, synthetic data, and other applications. To further illustrate this connection, consider Figure 3, which illustrates two features

X_{1}

and

X_{2}

mapped to two correlated Gaussian variables

G_{1}

and

G_{2}

; the marginal histograms are also shown. The data can be mapped back and forth from the two spaces using Equation (3) and its inverse. These data came from the phoneme dataset [30].

Throughout this section, we have assumed continuous-valued features/random variables. Discrete-valued features can be modeled as discrete random variables and, with some modifications, can be handled in the same manner; see Appendix A for additional details.

2.2. Properties

We next derive some properties of the YADA model, such as the correlation matrix and joint distribution functions of

(X_{1}, \dots, X_{d})

. As mentioned in the previous section, there is a direct connection between the features

{X_{i}}

and their Gaussian images

{G_{i}}

; see Equation (3). It follows that many of the properties of

{X_{i}}

can be expressed in terms of the well-known properties of

{G_{i}}

2.2.1. Marginal Distributions

The marginal cdf and pdf of

X_{i}

, the model for the ith feature, define the probability distribution of the ith feature independent of the others. They can be expressed in terms of

Φ

and

ϕ

, the cdf and pdf of a standard univariate Gaussian random variable. This relationship is

\begin{matrix} F_{i} (x_{i}) = Φ (w_{i} (x_{i})) \end{matrix}

(5)

and

\begin{matrix} f_{i} (x_{i}) = \frac{d}{d x_{i}} F_{i} (x_{i}) = |\frac{d}{d x} w_{i} (x_{i})| ϕ (w_{i} (x_{i})), \end{matrix}

(6)

where we have introduced function

w_{i} (x) = h_{i}^{- 1} (x)

for notational convenience. Functions

F_{i}

and

f_{i}

defined by Equations (5) and (6) are the marginal cdf and pdf, respectively, of non-Gaussian random variable

X_{i}

2.2.2. Joint Distribution

The marginal distributions are not sufficient to define a collection of two or more random variables; we need the joint distribution to perform this. Next, let

\begin{matrix} F (x_{1}, \dots, x_{d}) = Pr (X_{1} \leq x_{1}, \dots, X_{d} \leq x_{d}) \end{matrix}

denote the joint cdf of all d features

X_{1}, \dots, X_{d}

. It follows that the joint pdf of

X_{1}, \dots, X_{d}

can be expressed in terms of the d-variate normal pdf (Equation (4)), i.e.,

\begin{matrix} f (x_{1}, \dots, x_{d}) & = \frac{\partial^{d}}{\partial x_{1} \dots \partial x_{d}} F (x_{1}, \dots, x_{d}) \\ = |\prod_{i = 1}^{d} \frac{d}{d x_{i}} w_{i} (x_{i})| ϕ_{d} (w_{1} (x_{1}), \dots, w_{d} (x_{d}); c) \\ = \prod_{i = 1}^{d} (\frac{f_{i} (x_{i})}{ϕ (w_{i} (x_{i}))}) ϕ_{d} (w; c), \end{matrix}

(7)

where

w = {(w_{1} (x_{1}), \dots, w_{d} (x_{d}))}^{T}

and the last line follows from the results in Appendix B. Hence, the joint pdf of

X_{1}, \dots, X_{d}

is the product of their marginal distributions and the joint pdf of

G_{1}, \dots, G_{d}

As mentioned in the introduction, the YADA model works by capturing the pairwise correlations and marginal distributions of the dataset of interest. However, the joint pdf as imposed by Equation (7) may not be sufficient to represent the true joint distribution of the underlying data. This should be considered when building the YADA model and is a trade-off between more complex models.

To demonstrate the joint distribution of a YADA model, we consider a simple example. Let

\begin{matrix} X_{1} & = h_{1} (G_{1}) = 2 + e^{G_{1}} \\ X_{2} & = h_{2} (G_{2}) = 2 + e^{1 / 2 + G_{2}} \end{matrix}

be the YADA model for

d = 2

features, where

G_{1}

and

G_{2}

are correlated Gaussian variables with

E [G_{1} G_{2}] = - 0.2

. Features

X_{1}

and

X_{2}

are correlated and each has a log-normal marginal distribution. Contours of

f (x_{1}, x_{2})

, the joint pdf of

{(X_{1}, X_{2})}^{T}

, are illustrated by the right panel in Figure 4; the left panel illustrates contours of

ϕ_{2} (u_{1}, u_{2})

, the joint pdf of

{(G_{1}, G_{2})}^{T}

, i.e., the Gaussian image of

{(X_{1}, X_{2})}^{T}

. Again, the connection between these two functions is defined by Equation (7) with

d = 2

2.2.3. Pairwise Correlations

The correlation between two features

X_{i}

and

X_{j}

can be expressed in terms of

c_{i j} = E [G_{i} G_{j}]

, the correlation between their Gaussian images, i.e.,

\begin{matrix} E [X_{i} X_{j}] = E [h_{i} (G_{i}) h_{j} (G_{j})] = \int_{R^{2}} h_{i} (u) h_{j} (v) ϕ_{2} (u, v; (\begin{matrix} 1 & c_{i j} \\ c_{i j} & 1 \end{matrix})) d u d v, \end{matrix}

(8)

where

E [A]

denotes the expected value of random variable A, and

ϕ_{2}

is the bi-variate normal pdf with zero mean defined by Equation (4) with

d = 2

. It is typically not possible to further simplify the relationship between

E [X_{i} X_{j}]

and

E [G_{i} G_{j}]

We note that there is no guarantee that the resulting matrix

{E [X_{i}, X_{j}]}

constructed from these correlations will be positive semi-definite, as is required for it to be a valid correlation matrix. This issue has been studied extensively (see [16,31], Section 3.1) and some solutions have been suggested, e.g., [32]. However, this is a not an issue in our case because, as will be described in Section 3.1, we compute the correlation matrix of the Gaussian image of the data, which is guaranteed to be positive semi-definite [33] (Chapter 7).

2.2.4. Likelihood Functions

The likelihood function can be used to measure how well the YADA model explains observed data; we can define a marginal likelihood function for data on a single feature, as well as a joint likelihood for all features in the dataset. The marginal log-likelihood function is the log of the marginal pdf defined by Equation (6), i.e.,

\begin{matrix} ln (f_{i} (x)) = ln (| w_{i}^{'} (x) |) - \frac{1}{2} (ln (2 π) + w_{i} {(x)}^{2}) . \end{matrix}

(9)

The joint log-likelihood function follows from Equation (7) and can be expressed as

\begin{matrix} ℓ (x) = ln (f (x)) & = \sum_{i = 1}^{d} ln (\frac{f_{i} (x_{i})}{ϕ (w_{i} (x_{i}))}) + ln (ϕ_{d} (w; c)) \\ = \sum_{i = 1}^{d} ln (f_{i} (x_{i})) - \sum_{i = 1}^{d} ln (ϕ (w_{i} (x_{i}))) \\ - \frac{1}{2} (ln (\det (c) + w^{T} c^{- 1} w + d ln (2 π)) . \end{matrix}

(10)

Note that the first two terms in Equation (10) are the log of the marginal likelihood function for feature

X_{i}

and its Gaussian image, and the third term is the log-likelihood function of the d-variate normal distribution with zero mean.

2.2.5. Conditional Distributions

We can also derive the distribution of a conditional YADA model, which can be useful when the values of some of the features are known and fixed. Let the feature vector be partitioned as

\begin{matrix} {(X_{1}, \dots, X_{d})}^{T} = {(Y_{1}, \dots, Y_{m}, Z_{m + 1}, \dots, Z_{d})}^{T} = (\begin{matrix} Y \\ Z \end{matrix}) \end{matrix}

where the features have been re-ordered such that any with known fixed values are collected into vector

Z

, and the remaining

m \leq d

features are random variables and collected into vector

Y

. The joint pdf of

Y

, given that

Z = z

is known and fixed, can be obtained by exploiting the well-known property that conditional Gaussian variables also follow a Gaussian distribution [31] (Appendix C), that is,

\begin{matrix} f_{Y | Z} (y | z) = \frac{f (x)}{f_{Z} (z)} = \prod_{i = 1}^{m} (\frac{f_{i} (y_{i})}{ϕ (w_{i} (y_{i}))}) ϕ_{m} (w_{Y}; c_{Y Z} c_{Z Z}^{- 1} w_{Z}, c_{Y Y} - c_{Y Z} c_{Z Z}^{- 1} c_{Y Z}^{T}), \end{matrix}

(11)

where

\begin{matrix} c = (\begin{matrix} c_{Y Y} & c_{Y Z} \\ c_{Y Z}^{T} & c_{Z Z} \end{matrix}), w_{Y} = {(w_{1} (z_{1}), \dots, w_{m} (z_{m}))}^{T}, and w_{Z} = {(w_{m + 1} (z_{m + 1}), \dots, w_{d} (z_{d}))}^{T} . \end{matrix}

2.2.6. Entropy and K-L Divergence

Some information on the theoretic properties of the YADA model can also be established. The (differential) joint entropy of a YADA model is given by

\begin{matrix} - \int_{R^{d}} f (x) log f (x) d x, \end{matrix}

(12)

where f is the joint pdf defined by Equation (7), and

log f (x)

can be obtained from Equation (10).

Suppose that we have two YADA models with joint pdfs

f^{(i)}

and

f^{(j)}

; the relative entropy or Kullback-Liebler (K-L) divergence is a type of statistical distance between the two models. The K-L divergence of model i from reference model j is given by

\begin{matrix} d_{KL} (i, j) = \int_{R^{d}} f^{(i)} (x) log (\frac{f^{(i)} (x)}{f^{(j)} (x)}) d x . \end{matrix}

(13)

Using Equations (7) and (10), we have

\begin{matrix} d_{KL} (i, j) = \int_{R^{d}} \prod_{k = 1}^{d} (\frac{f_{k}^{(i)} (x_{k})}{ϕ (w_{k}^{(i)} (x_{k}))}) [\sum_{k = 1}^{d} (ln (\frac{f_{k}^{(i)} (x_{k})}{f_{k}^{(j)} (x_{k})}) + ln (\frac{ϕ (w_{k}^{(j)} (x_{k}))}{ϕ (w_{k}^{(i)} (x_{k}))})) - \\ \frac{1}{2} ln (\frac{\det (c^{(i)})}{\det (c^{(j)})}) - \frac{1}{2} {(w^{(i)})}^{T} {(c^{(i)})}^{- 1} w^{(i)} + \\ \frac{1}{2} {(w^{(j)})}^{T} {(c^{(j)})}^{- 1} w^{(j)}] ϕ_{d} (w^{(i)}; c^{(i)}) d x, \end{matrix}

(14)

where

c^{(i)}

and

w^{(i)}

represent covariance matrix

c

and vector

w

defined above for model i. The K-L divergence is not symmetric, meaning that

d_{KL} (i, j) \neq d_{KL} (j, i)

in general.

Equation (14) may be difficult to compute in practice, but there is an alternative. Because of the connection between each feature and its Gaussian image, i.e., Equation (3), we can also compute the K-L divergence in the space of multivariate normal random variables as

\begin{matrix} d_{KL} (i, j) = \frac{1}{2} (tr ({(c^{(j)})}^{- 1} c^{(i)}) - ln (\frac{\det (c^{(i)})}{\det (c^{(j)})}) - d), \end{matrix}

(15)

where

tr (c)

denotes the trace of matrix

c

, and d is the number of features.

3. YADA for Machine Learning

In this section, we discuss using the YADA model in the context of machine learning applications. The topics of training a YADA model and utilizing it for classification are presented in Section 3.1 and Section 3.2, respectively. As described in Section 3.3, explanations for predicted labels can also be provided; a separate model for explainability is not needed. An approach for assessing confidence in ML model predictions with YADA is presented in Section 3.4, and Section 3.5 contains a discussion on creating synthetic data from a trained YADA model. Examples using standard machine learning datasets are included throughout for demonstration.

3.1. Training

Suppose we have training data

{{(x, y)}_{j}, j = 1, \dots, n}

, where each

x = {(x_{1}, \dots, x_{d})}^{T}

is a vector of d features, and each

y \in {0, \dots, κ - 1}

is a class label. In this section, we describe how a YADA model can be trained with these data, that is, how the data can be used to learn the functions

h_{1}, \dots, h_{d}

and the correlation matrix

c

defined by Equations (3) and (4), respectively.

We first partition the training data per class, then train

κ

YADA models, one for each class. Each model represents the probability distribution for a single class; we are not modeling one class versus all of the other classes. For each class, we follow five steps for training:

Estimate $F_{i}$ , the marginal cdf for each of the d features. This can be performed, for example, using the empirical cdf or kernel-based methods such as the kernel density estimator with a Gaussian kernel. The former is appropriate if it is especially important to maintain the support of each random variable observed in the data, while the latter is important if interpolation within and extrapolation beyond the support of the observed data are desired. Further, YADA has the ability to use the empirical cdf combined with an extrapolation model for the left and right tails; see Appendix C.
Estimate the corresponding inverse marginal cdf for each feature. If a kernel density estimator was used for the previous step, we can solve for the inverse using, for example, interpolating spline functions.
Compute the Gaussian image of each feature in the training set using Equation (2); this will produce n samples of a random vector that we assume follows a d-variate normal distribution.
Compute the sample covariance matrix $c$ of the data produced in the previous step.
Compute the inverse and log-determinant of $c$ . In practice, we compute the pseudo-inverse and pseudo-determinant to handle any numerical issues caused by collinearity in the data. For any features with zero variance, IID Gaussian noise with small variance can be added to the data.

As mentioned, we first partition the training data per class, then train

κ

YADA models, one for each class. This approach is useful when applying YADA for classification, where it is necessary to evaluate the likelihood that a test point comes from the YADA model for each of the labels; this will be discussed in Section 3.2. As in other ML methods, the quality of the model depends on the quality and amount of the training data. If data of one class outnumber those of another class, YADA models for any underrepresented classes will have lower quality. However, the amount of data required by YADA is significantly less than that typically required for DL models.

It is possible to instead train a single YADA model to all of the data regardless of class label. As discussed in Section 3.5, this approach is useful for synthetic data generation when, in addition to producing synthetic data on features alone, we also want to produce synthetic data on the corresponding labels.

Once trained, comparing the YADA models can provide useful information about how well separated the training data are among the different classes. This measure can provide a baseline for trustworthiness; if the classes are close together and hence difficult to separate, we should perhaps be skeptical of any predictions by ML models trained on these data. To illustrate, we consider the Modified National Institute of Standards and Technology (MNIST) dataset of images of handwritten digits 0, 1, …, 9 [34]. The dataset contains 60,000 images for training and 10,000 images for testing, and they are approximately evenly distributed over the 10 classes. Each image contains 28 × 28 grayscale pixels which, when scaled, can be interpreted as

28^{2} = 784

continuous features that take values in the interval

[0, 1]

. We trained a YADA model for each class, and then, as described in Section 2.2.6, computed the K-L divergence amongst the 10 models. The results are illustrated in Figure 5. The left panel illustrates the K-L divergence between the 10 models as a matrix colored by the value of the divergence, with values ranging from 0 along the diagonal to a maximum of approximately 350; the divergence is greatest for the YADA models of class 0 and class 1. The right panel illustrates a “class separation score”, which is simply a symmetric version of the K-L divergence, i.e.,

(d_{KL} (i, j) + d_{KL} (j, i)) / 2

, with each being normalized by the entropy of the reference distribution (see Equation (12)). The class separation scores take values between zero and one and are illustrated as a bar plot, sorted in decreasing order. The greatest separation is between classes 0 and 1; the least separation is between classes 7 and 9.

3.2. Classification

Suppose we have followed the steps outlined in Section 3.1, resulting in

κ

-trained YADA models, one for each class label. We distinguish YADA models for different classes by using a superscript to denote the class label for all relevant model properties. For example,

c^{(i)}

will denote the sample covariance obtained in training step 4 and

f^{(i)}

represents the joint PDF defined by Equation (7), both for the YADA model for class i. Let

x^{*}

denote a feature vector with an unknown label. In this section, we present ways to use the collection of YADA models to predict the label for

x^{*}

3.2.1. Likelihood

The probability that

x^{*}

comes from class i is given by a version of Bayes’ formula, i.e.,

\begin{matrix} Pr (class i | x^{*}) = \frac{Pr (x^{*} | class i) Pr (class i)}{\sum_{j = 0}^{κ - 1} Pr (x^{*} | class j) Pr (class j)} = \frac{f^{(i)} (x^{*}) π^{(i)}}{\sum_{j = 0}^{κ - 1} f^{(j)} (x^{*}) π^{(j)}}, \end{matrix}

(16)

where

π^{(i)} = Pr (class i)

is the prior probability that an unlabeled test point belongs to class i, and

f^{(i)} (x^{*})

defined by Equation (7) can be interpreted as the joint likelihood that

x^{*}

comes from class i. Typically, the prior probabilities are assumed to be the relative frequency of each class in the training dataset; non-informative priors, where each class is assumed equally likely, can also be used. To address any numerical issues, for calculations, we instead compute the log of Equation (16), i.e.,

\begin{matrix} ln (Pr (class i | x^{*})) \propto ℓ^{(i)} (x^{*}) + ln (π^{(i)}), \end{matrix}

where

ℓ^{(i)} = ln (f^{(i)})

is given by Equation (10), and

a \propto b

means that a is linearly proportional to b.

To demonstrate classification, we trained three YADA models on the Fisher iris data, one per class label, then evaluated Equation (16) throughout the feature space to determine the classification regions for each label, that is, the regions

\begin{matrix} R_{i} = {x : Pr (class i | x) > Pr (class j | x), \forall j \neq i} . \end{matrix}

(17)

Hence, region

R_{i}

is the set of all points in feature space that YADA will predict as coming from class i. Figure 6 illustrates 2D projections of regions

R_{0}

(blue),

R_{1}

(orange), and

R_{2}

(green). The training data are also shown for reference. Note that in the plot on the right, the white area represents a part of the feature space where the likelihood of all three models is extremely small, and YADA cannot provide a prediction for test points that fall in this area.

3.2.2. Mahalanobis Distance

One alternative approach for classification is to utilize the Mahalanobis distance, a measure of the distance of a point to a distribution. Given a test point

x^{*}

, we can calculate its Gaussian image

g^{(i)}

assuming it originates from class i;

g^{(i)}

is a vector of d coordinates, with each coordinate obtained by using Equation (2), i.e.,

\begin{matrix} g_{j}^{(i)} = Φ^{- 1} (F_{j}^{(i)} (x_{j}^{*})), \end{matrix}

where

F_{j}^{(i)}

is the cdf for feature j assuming class i.

For classification, we assign

x^{*}

to class i if it is closest to the YADA model for class i, that is, if

m_{i} (x^{*}) \leq m_{j} (x^{*})

j = 0, \dots, κ - 1

, where

\begin{matrix} m_{i} (x^{*}) = {({(g^{(i)})}^{T} {(c^{(i)})}^{- 1} g^{(i)})}^{1 / 2} \end{matrix}

(18)

is the Mahalanobis distance of

x^{*}

from the YADA model for class i.

3.2.3. Ensemble Method with Dimension Reduction

A second alternative to using the likelihood approach is to classify using an ensemble of reduced-order YADA models. This approach can be beneficial if the computational cost of using all features is too great, or when there exists some collinearity between the features that results in a poorly conditional covariance matrix. Let

I^{'} \subseteq {1, \dots, d} = I_{d}

be a subset of d features, with

| I^{'} | = d^{'} \leq d

. The

d^{'}

features can be selected at random or by some feature importance metric. From Equation (7), the joint PDF of the reduced-order feature vector is obtained by integrating out the dependence on those features not contained in

I^{'}

, i.e.,

\begin{matrix} \int_{R^{d - d^{'}}} f (x) \prod_{i \in I_{d} ∖ I^{'}} d x_{i} = \prod_{i \in I^{'}} (\frac{f_{i} (x_{i})}{ϕ (w_{i} (x_{i}))}) ϕ_{d^{'}} (w^{'}; c^{'}), \end{matrix}

(19)

where

w^{'}

is a vector with

d^{'}

coordinates

w_{i} (x_{i})

i \in I^{'}

, and

c^{'}

is a

d^{'} \times d^{'}

covariance matrix obtained by retaining the rows and columns from the original covariance matrix

c

defined by Equation (7) with indices contained in

I^{'}

For the ensemble classification of unlabeled test point

x^{*}

, we follow three steps:

Choose $d^{'} \leq d$ features at random from the original set to make up the reduced feature set, $I^{'}$ ;
Remove the corresponding $d - d^{'}$ features from the test point $x^{*}$ that do not belong to $I^{'}$ ;
Predict the label for the reduced-order version of $x^{*}$ using the methods defined in Section 3.2.1 or Section 3.2.2 with the reduced-order YADA model with joint pdf defined by Equation (19).

By repeating the above steps numerous times, we obtain an ensemble of label predictions, and we can interpret the empirical distribution over the class labels to be the probabilities that

x^{*}

belongs to each class.

3.3. Explanations

Recall Equation (16), where the numerator term

f^{(i)} (x^{*}) π^{(i)}

can be interpreted as the posterior joint likelihood that test point

x^{*}

comes from class i. The posterior marginal likelihoods are given by

\begin{matrix} f_{j}^{(i)} (x_{j}^{*}) π^{(i)}, j = 1, \dots, d, \end{matrix}

(20)

where

f_{j}^{(i)}

is the marginal pdf of feature j defined by Equation (6), assuming class i, and

x_{j}^{*}

is the value of the jth feature at the test point. The values defined by Equation (20) approximately explain how each feature contributes to the posterior probability that

x^{*}

comes from class i. This is only an approximation; the exact relationship between the marginal and joint likelihoods is given by Equation (7), and is more complex.

One approach to explainability is to simply plot the scores defined by Equation (20). To illustrate, we consider a dataset containing measurements of 570 human cancer cells [35]. Thirty cell features are recorded, such as cell area and perimeter; each measurement has a label of either benign (B) or malignant (M). We train a YADA model for each class, and then use the marginal likelihoods defined by Equation (6) as a means for explaining class predictions. Figure 7 illustrates the marginal likelihoods of a test point with true label M; the marginal likelihood that the point comes from both classes are illustrated for each feature. The likelihood that the point comes from class M is greater for nearly all 30 features, but most significantly for features 14 (smoothness_se) and 19 (fractal_dimension_se).

A second approach to explainability using YADA, particularly useful when the number of features d is large, is to examine a subset of the d posterior marginal likelihoods, for example, a subset of the 10 largest posterior marginal likelihoods, or those that are larger than the average of all d posterior marginal likelihoods, because the members of this subset contribute more to the posterior probability. To illustrate, recall the MNIST dataset [34] introduced previously. We trained a YADA model for each class, then predicted labels for each image in the test set. The first two test images are illustrated in Figure 8. In each image, 14 pixels are highlighted (white); these are the features that contribute the most to the joint likelihood score. They can be interpreted as explaining which pixels were most important to the classifier. We observe that pixels near the outline of a digit are important, but pixels with zero values are also indicative.

3.4. Prediction Confidence

Let

x^{*}

denote an unlabeled test point with predicted label

y^{*}

; this prediction can originate from YADA and the methods from Section 3.2, or from a completely separate ML model such as a trained neural network. There are a variety of sources of uncertainty that impact the outputs of ML models [36,37]. High uncertainty leads to low confidence in the predicted label. However, ML (and particularly DL) models are generally overconfident in their predictions [9,38]. We can use YADA to provide an independent confidence score for any test point.

In this section, we present a method to assess the confidence in the predicted label

y^{*}

that is a function of the distance of the test point to the training data. Computing the confidence values can be used to (1) quantify the uncertainty of a prediction from an ML model on inference data for tasks such as out-of-distribution (OOD) detection (when inference data are different from the data used for training) [9]; (2) identify outliers and anomalies in training data that could have adverse effects when used for training [39]; and (3) detect concept drift when the target data have changed over time and invalidates the ML model [40]. As a method for detecting OOD data, we show that YADA is an effective method for measuring confidence; we utilize the Mahalanobis distance of

x^{*}

from a trained YADA model, as defined by Equation (18). Two approaches are considered.

3.4.1. Empirical Approach

One approach is to compute the Mahalanobis distance of

x^{*}

from the YADA model trained on the data with predicted class label

y^{*}

, then compare this distance with the population of distances of the training data from that same model. For example, if

y^{*} = i

is the predicted class label, we would compare

m_{i} (x^{*})

defined by Equation (18) with the population of distances

m_{i} ({x^{(i)}})

of all training data for class i. Intuitively, if

m_{i} (x^{*})

is close to the mean of the population, we can have higher confidence in the predicted label

y^{*}

than if

m_{i} (x^{*})

is in the tail of the population.

3.4.2. Approach Based on Theory

Let

G = {(G_{1}, \dots, G_{d})}^{T}

be a Gaussian random vector with mean vector

μ

and covariance matrix

c

. It is well known that the random variable

M^{2} = {(G - μ)}^{T} c^{- 1} (G - μ)

follows a chi-squared distribution with d degrees-of-freedom. For applications, we instead have samples of vector

G

, which (approximately) follows a d-variate normal distribution, and we can compute the sample mean vector

\hat{μ}

and sample covariance matrix

\hat{c}

. This result provides a theoretical means of determining the likelihood that the Mahalanobis distance

m_{i} (x^{*})

came from the theoretical distribution for the Mahalanobis distance for class i.

Using this approach, we can derive the following as a measure for the confidence that test point

x^{*}

comes from class i

\begin{matrix} {conf}_{i} (x^{*}) \propto exp (- \frac{1}{2} (m_{i} {(x^{*})}^{2} - d + 1)), \end{matrix}

(21)

where d is the number of features in the model; this expression is based on the fact that the square of the Mahalanobis distance of a multivariate normal random vector follows the chi-squared distribution with d degrees-of-freedom.

3.4.3. Applications

To illustrate these concepts for prediction confidence, we considered three different applications. First, we trained three YADA models on the Fisher iris dataset, one model per class label. We then treated each of the 150 data points as test points, and classified them using the likelihood approach. The confidence in these predictions as computed by Equation (21) are illustrated by Figure 9.

The data are ordered so that points 1–50 are from class Iris-setosa, points 51–100 are from Iris-versicolor, and points 101–150 are from class Iris-virginica. For points 1–50, the prediction confidence that these points come from the Iris-setosa class is very high, and the confidence for the other two classes is near zero. For points 51–100, the prediction confidence is, in general, the greatest for the (correct) Iris-versicolor class, but the confidence that these points come from Iris-virginica is not zero, so there is some uncertainty in the predicted class labels for these points. Similar results can be observed for points 100–150.

A second example is based on the MNIST dataset of images of handwritten digits 0, 1, …, 9 [34]. We trained a YADA model for each class, then predicted labels for each image in the test set; we also computed confidence values for each predicted label. The five test images with the greatest confidence values are illustrated by the top row in Figure 10. The bottom row illustrates the test images of a ‘2’ and ‘6’ that had the greatest (left) and least (right) confidence values.

For a third example, we utilize a DL model (ResNet50 [41]) pre-trained on the CIFAR-10 dataset and compare how well YADA detects OOD data. CIFAR-10 [42] is a 10-class dataset for general object detection containing 50,000 training samples. For this study, data from CIFAR-10 is considered to be in-distribution and any other data are OOD. We then consider the following benchmark object recognition datasets in an OOD study: (1) CIFAR-100 [42], which is similar to CIFAR-10 except that it has 100 classes; (2) Tiny ImageNet (TIN) [43], an object detection dataset that is a subsample of the ImageNet database of over 14 million images (with any intersection of images from CIFAR-10 removed); (3) MNIST [44]; (4) SVHN [45], a dataset of mostly house numbers representing digits in natural images; (5) Texture [46], a collection of textural images in the wild; and (6) Places365 [47], a scene recognition dataset. We apply the pre-trained DL model and framework for executing experiments from OpenOOD [48] that compare other state-of-the-art OOD detection methods. In particular, we use the values from the last layer in the neural network as input to the OOD detection methods.

We limited our approach to state-of-the-art OOD detection methods. Other probabilistic methods could have been used but would require a higher computational overhead. Thus, rather than increasing the computational overhead, we chose to focus on the selected methods. Other DL models could have been used and may provide some variation in results. However, the provided example highlights the ability of YADA to quantify the uncertainty of OOD data with similar performance as other state-of-the-art techniques.

We compute the Mahalanobis distance as our confidence measure and use it for OOD detection in YADA and compare against the following: traditional Mahalanobis distance (MDist) [49], Virtual Logit Matching (ViM) [50], and Deep Nearest Neighbor (DNN) [51]. YADA models trained using both the empirical distribution and a kernel density estimator are considered. The results are summarized in Table 1, which correspond to area under the corresponding ROC curve. In all cases except for the Texture dataset, YADA outperforms the traditional Mahalanobis distance (MDist), suggesting that transforming the data to the MVN space (meeting the assumptions of Mahalanobis distance) is beneficial. YADA is competitive with the other OOD methods considered.

3.5. Synthetic Data Generation

Synthetic data have a variety of uses, such as creating additional training or testing data, or to compensate for label and/or feature imbalance. In particular, YADA can be used to produce synthetic but realistic images for image-related applications. Because YADA is a probabilistic model, it is straightforward to generate synthetic data. There are several ways to perform this, as will be described in the following sections.

Recall from Section 3.1 that we can take two approaches to train a YADA model. First, we can partition the data according to class label, which produces one trained YADA model for each class. To produce synthetic data for this case, we apply the procedures described below for each YADA model separately. Second, we train a single YADA model on all the available data regardless of class. In this approach, we simply treat the label as an additional “feature” and build and train a single model on

d + 1

features.

3.5.1. Random Sampling

The first approach to creating synthetic data is by simple random sampling, where samples are drawn independently at random from the trained joint distribution of a YADA model, i.e., Equation (7). This is the most straightforward and probably the most common way to produce synthetic data. An algorithm to create random samples of a YADA model requires two steps:

Create independent samples of a d-variate Gaussian vector with zero mean and covariance matrix $c$ ;
Map the samples of ${G_{i}}$ to samples of ${X_{i}}$ using Equation (3).

There are numerous methods to execute step (1) provided by various statistical packages.

3.5.2. Conditional Sampling

There may be scenarios where we want to create synthetic data with one or more of the features held constant at a fixed value. For example, we may want to produce synthetic images but know that pixels along the boundary should always be of one color; a conditional sampling approach can be used to achieve this.

Suppose the vector

X

of d features is re-ordered so that all

d - m

features with known fixed values are collected into vector

Z

and the remaining m features are collected into vector

Y

, i.e.,

X = (Y, Z)

. An algorithm to create samples of

Y | Z = z

requires six steps:

Re-arrange the columns and rows of the covariance matrix $c$ to be consistent with the re-ordered feature vector;
Partition $c$ as described in Section 2.2.5 to create sub-matrices $c_{Y Y}$ , $c_{Y Z}$ , and $c_{Z Z}$ ;
Compute $w_{Z}$ , the Gaussian image of $z$ , using Equation (2);
Create samples of an m-variate Gaussian vector with mean $c_{Y Z} c_{Z Z}^{- 1} w_{Z}$ and covariance matrix $c_{Y Y} - c_{Y Z} c_{Z Z}^{- 1} c_{Y Z}^{T}$ ;
Map these samples to samples of ${(Y_{1}, \dots, Y_{m})}^{T}$ using Equation (3);
Assemble the full feature vector by appending the fixed values $z$ , and then retract the re-order step.

3.5.3. High Probability Sampling

For some applications, it may be of interest to create synthetic data from samples that occur with high probability, i.e., samples “near” the mean of the joint distribution instead of out in the tails. Let

p \in [0, 1]

be the target probability value. The idea is to modify the algorithm for the random sampling presented above to include an intermediate step, where we compute the Mahalanobis distance of each sample from the origin, and keep only those samples with distances less than

r^{2} = {(χ_{d}^{2})}^{- 1} (p)

, where

χ_{d}^{2}

denotes the cdf of the chi-squared distribution with d degrees-of-freedom. Hence, if p is close to zero, many samples are rejected and only those very close to the origin are retained. Likewise, all samples are retained when

p = 1

, i.e., random sampling.

An algorithm to create p-probability samples requires five steps:

Create one sample of a d-variate Gaussian vector with zero mean and covariance matrix $c$ ;
Compute $m^{2}$ , the square of the Mahalanobis distance of this sample to the origin;
Retain the sample if $m^{2} \leq r^{2}$ ; otherwise, reject it;
Repeat steps 1–3 until the desired number of samples have been retained;
Map the retained samples of ${G_{i}}$ to samples of ${X_{i}}$ using Equation (3).

3.5.4. Applications

In this section, we illustrate the different sampling methods for several example applications. First, consider again the cancer cell data [35] introduced in Section 3.3. Here, we trained a YADA model for class B and another for class M, then used them to create 500 synthetic data points; random sampling was used. Figure 11 illustrates scatter plots for six random pairings of the thirty available features. Each panel illustrates both the 569 training data (’x’) and 500 synthetic data (’o’) for both benign (B) and malignant (M) class labels.

As a second example, recall the MNIST dataset of images of handwritten digits 0, 1, …, 9 [34]. We trained a YADA model for each class, then used them to create synthetic images of handwritten digits. Within the dataset, there are numerous features (pixels) that are blank for all images considered (e.g., the four corners of an image). Conditional sampling was therefore utilized to ensure that the synthetic images also had these same blank pixels. Figure 12 illustrates some of the results; original training images are shown in the top row, and synthetic images of the same class are shown in the bottom row. These results demonstrate that capturing the marginal distributions and pairwise correlations of the training data is sufficient to produce adequate synthetic images of handwritten digits.

A third example illustrates high probability sampling; Figure 13 illustrates 500 synthetic data from YADA models trained to the Fisher iris dataset. From left to right, the scatter plots show data for two of the features assuming probability values

p = 1

p = 0.5

p = 0.1

, and

p = 0.005

. As p approaches zero, the samples approach increasingly closer to the sample mean, denoted by a black ‘x’; the case wherein

p = 1

is identical to simple random sampling.

As a final example, we consider the bone marrow transplant dataset [52] containing 36 features from 187 pediatric patients, including age, body mass, dosage, and recovery time. An MLP classifier was trained on these data to predict patient survival status; the mean accuracy of the classifier for a 5-fold cross validation analysis was 70%. We also trained a YADA model on these data and applied random sampling to produce an additional 250 synthetic training data. Once re-trained, the mean accuracy of the classifier improved to 83%.

4. Conclusions

While deep learning (DL) methods have enjoyed much success in recent years, they remain overly complex, over-parameterized, overconfident in predicting unknown labels, and difficult to explain and/or interpret. This motivated our work on a probabilistic model for ML applications we refer to as Yet Another Discriminant Analysis (YADA). YADA is a general-purpose probabilistic model that can be used for a wide variety of ML tasks that DL models do not perform well, including (1) providing explanations; (2) quantifying the uncertainty in a prediction; and (3) out-of-distribution detection. Further, YADA has few parameters and provides a mechanism to predict labels with explanations while also providing an associated confidence score from a single model (typically, predictions, explanations, and confidence are not all provided by a single model). YADA can represent continuous or discrete-valued features with arbitrary marginal distributions that exhibit nonzero correlations. It is based on a mathematically rigorous foundation so that joint distributions, joint likelihood functions, joint entropy, K-L divergence, and other properties are readily derived. In addition, realistic synthetic data can be generated using YADA in a variety of ways. The YADA algorithm is based on the translation random vector or Gaussian copula, which has seen widespread use in finance, engineering, climate science, and other disciplines, but its use for ML applications remains limited. As mentioned, YADA can be less accurate for classification tasks than many DL models and, for this reason, we do not make any direct comparisons on classification accuracy.

We view YADA as a first step in developing probabilistic models based on a mathematically rigorous foundation, and see several areas for improvement and future work. Currently, YADA operates on raw features, whereas DL algorithms are able to extract features through the layers of a neural network. Integrating feature learning could help improve the performance of YADA. Also, YADA is limited based on only modeling correlations between pairs of features. Improving modeling more complex feature interactions would increase the fidelity of YADA. We also mention that DL has received significant engineering research to improve hardware and software stacks the make training and inference more efficient. Similar gains would be helpful for scaling YADA, which is limited by inverting the covariance matrix.

Our intention is that this paper provides a foundation for further improvement in mathematically based probabilistic models that are simple enough to understand rather than start from overly complex models and trying to extract meaning from them. This paper serves as an alternative approach starting from more simplistic results that are largely overlooked by the ML community.

Author Contributions

R.V.F.J. contributed to conceptualization, methodology, software, validation, formal analysis, and all writing. M.R.S. contributed to conceptualization, methodology, software, validation, writing—review and editing, project administration, and funding acquisition. J.B.I. contributed to software and validation. E.J.W. contributed to software and validation. All authors have read and agreed to the published version of the manuscript.

Funding

This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Data Availability Statement

We utilized datasets that are readily available to the public, including (1) the Fisher iris dataset [53] designed to quantify the morphological variation in iris flowers of 3 related species; (2) the Palmer Archipelago penguin dataset [54], consisting of 5 different measurements of 344 penguins of 3 different species; (3) a dataset containing 30 measurements of 570 human cancer cells [35] to determine whether the cells in the dataset are benign or malignant; (4) a linguistics dataset of phonemes [30], which is the smallest unit of speech distinguishing 1 word (or word element) from another; and (5) the bone marrow transplant dataset [52] containing 36 features from 187 pediatric patients, including age, body mass, dosage, and recovery time. Several image datasets were also studied, including (6) The Modified National Institute of Standards and Technology (MNIST) image dataset of handwritten digits [34]; (7) CIFAR-10 [42], a 10-class dataset for general object detection containing 50,000 training samples; (8) CIFAR-100 [42], which is similar to CIFAR-10 except that it has 100 classes; (9) Tiny ImageNet (TIN) [43], an object detection dataset that is a subset of ImageNet; (10) SVHN [45], a dataset of mostly house numbers representing digits in natural images; (11) Texture [46], which contains a collection of textural images in the wild; and (12) Places365 [47], a general scene recognition dataset.

Acknowledgments

The authors would like to acknowledge Esha Datta, Eva Domschot, and Veronika Neeley at Sandia National Laboratories for many helpful technical discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Discrete-Valued Random Variables with YADA

Let X be a discrete random variable that takes values

x_{1} < \dots < x_{r}

with probabilities

p_{1}, \dots, p_{r}

, where

\sum_{i = 1}^{r} p_{i} = 1

. In this section, we show that X can be expressed in terms of

G \sim N (0, 1)

, a standard Gaussian random variable, as was performed in Section 2.1 and Section 2.2, assuming X was continuous. In particular, we develop results that are analogous to Equations (1), (2), and (6) below.

The cdf and probability mass function (pmf) of X are given by

F (x) = \sum_{i = 1}^{r} p_{i} 1 (x \geq x_{i}) and f (x) = \sum_{i = 1}^{r} p_{i} 1 (x = x_{i}),

(A1)

respectively, where

1 (A)

is the indicator function, equal to one if event A is true and equal to zero otherwise;

f (x)

, defined by Equation (A1), is a version of Equation (6) for discrete variables. It follows that we can express X as

\begin{matrix} X = h (G) = \sum_{i = 1}^{r} x_{i} 1 (G \in β_{i}), \end{matrix}

(A2)

where

β_{i} = (a_{i - 1}, a_{i}]

are sets on the real line with convention

β_{1} = (- \infty, a_{1}]

and

β_{r} = (a_{r - 1}, \infty)

. We can think of

{β_{i}}

and

{a_{i}}

as bins and bin boundaries, respectively, with boundaries defined by

\begin{matrix} a_{i} = Φ^{- 1} (p_{1} + \dots p_{i}), i = 1, \dots, r - 1 . \end{matrix}

As proof that Equation (A2) is a valid representation for X, note that

\begin{matrix} Pr (X = x_{i}) & = Pr (h (G) = x_{i}) \\ = Pr (G \in β_{i}) \\ = Φ (a_{i}) - Φ (a_{i - 1}) \\ = (p_{1} + \dots p_{i}) - (p_{1} + \dots p_{i - 1}) \\ = p_{i}, i = 1, \dots, r - 1 . \end{matrix}

For the special case of

i = r

Pr (X = x_{r}) = Pr (G \in β_{r}) = 1 - \sum_{i = 1}^{r - 1} p_{i} = p_{r}

The mapping defined by Equation (A2) is a quantization of the Gaussian distribution and represents a version of Equation (1) for arbitrary discrete variables. For example, take

r = 2

with

x_{1} = 0

x_{2} = 1

p_{1} = q

, and

p_{2} = 1 - q

. Then, by Equation (A2),

X = 1 (G > a)

is a binomial random variable with parameter

q \in (0, 1)

, where

a = Φ^{- 1} (p)

We interpret the inverse of Equation (A2) as

\begin{matrix} G = w (X) = \sum_{i = 1}^{r} {\tilde{G}}_{i} 1 (X = x_{i}), \end{matrix}

(A3)

where

{\tilde{G}}_{i}

is a truncated normal random variable with support

β_{i}

, that is,

\begin{matrix} Pr ({\tilde{G}}_{i} \leq z) = \frac{1}{p_{i}} (Φ (z) - Φ (a_{i - 1})), z \in β_{i} = (a_{i - 1}, a_{i}] . \end{matrix}

This is a version of Equation (2) for the case where X is a discrete random variable.

Appendix B. Derivative of w(x)

Recall that

w (x) = h^{- 1} (x)

is the inverse mapping introduced by Equation (2) for the case of continuous-valued features. The joint pdf of the feature vector defined by Equation (7) depends on

w^{'} (x)

, the derivative of this function, which we derive below.

First, note that

Φ : R \to (0, 1)

and

Φ^{- 1} : (0, 1) \to R

, the cdf of the

N (0, 1)

random variable and its inverse, are strictly increasing continuous bijective functions. If we let

z = Φ^{- 1} (y)

, then

y = Φ (z)

and

\begin{matrix} \frac{d y}{d z} = Φ^{'} (z) = ϕ (z), \end{matrix}

where

ϕ (z)

denotes the pdf of the

N (0, 1)

random variable. Then,

\begin{matrix} \frac{d}{d y} Φ^{- 1} (y) = \frac{d z}{d y} = \frac{1}{ϕ (z)} = \frac{1}{ϕ \circ Φ^{- 1} (y)} \end{matrix}

It follows (by the chain rule) that

\begin{matrix} w^{'} (x) & = {(Φ^{- 1} \circ F (x))}^{'} = ({(Φ^{- 1})}^{'} \circ F (x)) \cdot F^{'} (x) = \frac{f (x)}{ϕ \circ Φ^{- 1} (F (x))} \\ = \frac{f (x)}{ϕ (w (x))}, \end{matrix}

(A4)

where f and F are the pdf and cdf of random variable X defined by Equation (1). Further,

w^{'} (x) \geq 0

because both f and

ϕ

are pdfs and, therefore, non-negative.

Appendix C. Tail Modeling with YADA

Suppose we have samples

x_{1}, \dots, x_{n}

of random variable X, and assume that the samples have been ordered in ascending order. The empirical cdf built from these data is equal to zero for all

x < x_{1}

and equal to one for all

x \geq x_{n}

. Sometimes, it is useful to use models that allow us to extrapolate outside of the interval defined by the observed data, i.e., for

x < x_{1}

and/or

x > x_{n}

, the left and right tails of the distribution of X.

We introduce a method, referred to as tail modeling, to extrapolate outside of the observed data. The idea is simply to choose a distribution with a known functional form for the left and/or right tails, and calibrate these distributions using the available data to enforce continuity at

x = x_{1}

and

x = x_{n}

We consider three different tail models, selected because they offer different rates of decay in the tails; other tail models can of course be added if necessary. The tail models considered include the following:

Gaussian/normal tails, with pdf and cdf:

$\begin{matrix} f_{G} (x; μ, σ) = \frac{1}{σ} ϕ (\frac{x - μ}{σ}) and F_{G} (x; μ, σ) = Φ (\frac{x - μ}{σ}), \end{matrix}$

(A5)

where $μ$ and $σ > 0$ are parameters, and

$\begin{matrix} ϕ (x) = \frac{1}{\sqrt{2 π}} e^{- x^{2} / 2} and Φ (x) = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{x} e^{- u^{2} / 2} d u \end{matrix}$

denote the pdf and cdf, respectively, for the standard normal random variable. We note that the tails of the normal pdf decay to zero as $e^{- x^{2}}$ for $x \to \pm \infty$ .
Exponential tails, with pdf and cdf:

$\begin{matrix} f_{E} (x; μ, b) & = \frac{1}{2 b} e^{- | x - μ | / b} and \\ F_{E} (x; μ, b) & = \frac{1}{2} + \frac{1}{2} sgn (x - μ) (1 - e^{- | x - μ | / b}), \end{matrix}$

(A6)

where $μ$ and $b > 0$ are parameters. The tails of this pdf decay to zero as $e^{- | x |}$ for $x \to \pm \infty$ , which is slower than the decay rate of the Gaussian pdf.
Log-normal tails, with pdf and cdf:

$\begin{matrix} f_{L} (x; μ, σ, c) & = (\frac{1}{2 σ | x - c |}) ϕ (\frac{ln (| x - c |) - μ}{σ}) \\ F_{L} (x; μ, σ, c) & = \frac{1}{2} [1 + sgn (x - c) Φ (\frac{ln (| x - c |) - μ}{σ})], \end{matrix}$

(A7)

where $μ$ , c, and $σ > 0$ are parameters. The tails of this pdf decay at a rate in between that of the Gaussian and exponential pdfs.

Given the data

x_{1}, \dots, x_{n}

, we implement these tail models as follows:

Use the empirical cdf for $x_{1} \leq x < x_{n}$ ;
Compute $\hat{μ}$ , the sample mean of the data;
For any values $x < x_{1}$ (the left tail), set the pdf and cdf to $f_{1}$ and $F_{1}$ as specified in Table A1;
For any values $x \geq x_{n}$ (the right tail), set the pdf and cdf to $f_{n}$ and $F_{n}$ as specified in Table A1.

Note that we can choose to model only the left tail, only the right tail, or both. Further, the functional forms of the left and right tail models need not be identical.

To illustrate the tail modeling approach, suppose we have

n = 20

random samples of X. The cdf and left tail models are illustrated by the left panel in Figure A1; a log scale is applied to increase the visibility of the tail. The complementary cdf and right tail models are shown in the right panel of Figure A1. The blue line indicates the empirical cdf, which drops immediately to zero for

x < x_{1} \approx - 3.1

and for

x > x_{n} \approx 1.2

. The three tail models allow for extrapolating the cdf outside of the range

[x_{1}, x_{n}]

. The exponential model (green) exhibits the slowest decay to zero, followed by the log-normal (red) and Gaussian (orange) models.

Table A1. The left and right tail models for the pdf and cdf.

Type	$f_{1} (x)$ , $x < x_{1}$	$f_{n} (x)$ , $x > x_{n}$	$F_{1} (x)$ , $x < x_{1}$	$F_{n} (x)$ , $x > x_{n}$	Parameters
Gaussian	$f_{G} (x; \hat{μ}, σ_{1})$	$f_{G} (x; \hat{μ}, σ_{n})$	$F_{G} (x; \hat{μ}, σ_{1})$	$F_{G} (x; \hat{μ}, σ_{n})$	$σ_{1} = \frac{x_{1} - \hat{μ}}{Φ^{- 1} (\frac{1}{n})}$ , $σ_{n} = \frac{x_{n} - \hat{μ}}{Φ^{- 1} (1 - \frac{1}{2 n})}$
Exponential	$f_{E} (x; \hat{μ}, b_{1})$	$f_{E} (x; \hat{μ}, b_{n})$	$F_{E} (x; \hat{μ}, b_{1})$	$F_{E} (x; \hat{μ}, b_{n})$	$b_{1} = \frac{x_{1} - \hat{μ}}{ln (\frac{2}{n})}$ , $b_{n} = \frac{x_{n} - \hat{μ}}{ln (n)}$
Log-normal	$f_{L} (x; \hat{μ}, b_{1})$	$f_{L} (x; \hat{μ}, b_{n})$	$F_{L} (x; \hat{μ}, b_{1})$	$F_{L} (x; \hat{μ}, b_{n})$	$b_{1} = \frac{x_{1} - \hat{μ}}{ln (\frac{2}{n})}$ , $b_{n} = \frac{x_{n} - \hat{μ}}{ln (n)}$

Figure A1. Example models for the left and right tails of the cdf.

References

Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Xi Chen, O.; Duan, Y.; Schulman, J.; DeTurck, F.; Abbeel, P. # Exploration: A study of count-based exploration for deep reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 2750–2759. [Google Scholar]
Yamada, I.; Asai, A.; Shindo, H.; Takeda, H.; Matsumoto, Y. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. arXiv 2020, arXiv:2010.01057. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Dumitru, R.G.; Peteleaza, D.; Craciun, C. Using DUCK-Net for polyp image segmentation. Sci. Rep. 2023, 13, 9803. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2015, arXiv:1412.6572. [Google Scholar] [CrossRef]
Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.; Madry, A. Adversarial examples are not bugs, they are features. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS’19), Vancouver, BC, Canada, 8–14 December 2019; pp. 125–136. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Hendrycks, D.; Mazeika, M.; Dietterich, T. Deep Anomaly Detection with Outlier Exposure. arXiv 2019, arXiv:1812.04606. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Model-Agnostic Interpretability of Machine Learning. arXiv 2016, arXiv:1606.05386. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.G.; Lee, S.I. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv 2019, arXiv:1802.03888. [Google Scholar] [CrossRef]
Jobin, A.; Ienca, M.; Vayena, E. The global landscape of AI ethics guidelines. Nat. Mach. Intell. 2019, 1, 389–399. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 20–22 June 2016; pp. 1050–1059. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 6405–6416. [Google Scholar]
Grigoriu, M. Crossings of Non-Gaussian Translation Processes. J. Eng. Mech. 1984, 110, 610–620. [Google Scholar] [CrossRef]
Arwade, S.R. Translation vectors with non-identically distributed components. Probabilistic Eng. Mech. 2005, 20, 158–167. [Google Scholar] [CrossRef]
Elidan, G. Copulas in Machine Learning. In Proceedings of the Copulae in Mathematical and Quantitative Finance, Cracow, Poland, 10–11 July 2013; Jaworski, P., Durante, F., Härdle, W.K., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 39–60. [Google Scholar]
Kosowski, R.L.; Neftci, S.N. Principles of Financial Engineering, 3rd ed.; Academic Press: New York, NY, USA, 2015. [Google Scholar]
Targino, R.S.; Peters, G.W.; Shevchenko, P.V. Sequential Monte Carlo Samplers for capital allocation under copula-dependent risk models. Insur. Math. Econ. 2015, 61, 206–226. [Google Scholar] [CrossRef]
Lapuyade-Lahorgue, J.; Xue, J.H.; Ruan, S. Segmenting Multi-Source Images Using Hidden Markov Fields With Copula-Based Multivariate Statistical Distributions. IEEE Trans. Image Process. 2017, 26, 3187–3195. [Google Scholar] [CrossRef] [PubMed]
Qian, D.; Wang, B.; Qing, X.; Zhang, T.; Zhang, Y.; Wang, X.; Nakamura, M. Drowsiness Detection by Bayesian-Copula Discriminant Classifier Based on EEG Signals During Daytime Short Nap. IEEE Trans. Biomed. Eng. 2017, 64, 743–754. [Google Scholar] [CrossRef]
Meyer, D.; Nagler, T.; Hogan, R.J. Copula-based synthetic data augmentation for machine-learning emulators. Geosci. Model Dev. 2021, 14, 5205–5215. [Google Scholar] [CrossRef]
Schölzel, C.; Friederichs, P. Multivariate non-normally distributed random variables in climate research - Introduction to the copula approach. Nonlinear Process. Geophys. 2008, 15, 761–772. [Google Scholar] [CrossRef]
Field, R.V., Jr.; Constantine, P.; Boslough, M. Statistical surrogate models for prediction of high-consequence climate change. Int. J. Uncertain. Quantif. 2013, 3, 341–355. [Google Scholar] [CrossRef]
Yuan, Z.; Wang, J.; Worrall, D.M.; Zhang, B.B.; Mao, J. Determining the Core Radio Luminosity Function of Radio AGNs via Copula. Astrophys. J. Suppl. Ser. 2018, 239, 33. [Google Scholar] [CrossRef]
Carrillo, J.A.; Nieto, M.; Velez, J.F.; Velez, D. A New Machine Learning Forecasting Algorithm Based on Bivariate Copula Functions. Forecasting 2021, 3, 355–376. [Google Scholar] [CrossRef]
Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.; et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 2023, 56, 1513–1589. [Google Scholar] [CrossRef]
Ganian, R.; Korchemna, V. The complexity of Bayesian network learning: Revisiting the superstructure. Adv. Neural Inf. Process. Syst. 2021, 34, 430–442. [Google Scholar]
Nataf, A. Determination des distribution don’t les marges sont donnees. Comptes Rendus L’AcadÉMie Des Sci. 1962, 225, 42–43. [Google Scholar]
The Phoneme Dataset. Available online: https://www.kaggle.com/datasets/timrie/phoneme (accessed on 19 September 2023).
Grigoriu, M. Applied Non-Gaussian Processes; P T R Prentice-Hall: Englewood Cliffs, NJ, USA, 1995. [Google Scholar]
Field, R.V., Jr.; Grigoriu, M. A method for the efficient construction and sampling of vector-valued translation random fields. Probabilistic Eng. Mech. 2012, 29, 79–91. [Google Scholar] [CrossRef]
Papoulis, A.; Pillai, S.U. Probability, Random Variables, and Stochastic Processes, 4th ed.; McGraw-Hill, Inc.: New York, NY, USA, 2002. [Google Scholar]
The MNIST Dataset. Available online: https://www.kaggle.com/datasets/hojjatk/mnist-dataset (accessed on 29 October 2024).
The Cancer Dataset. Available online: https://www.kaggle.com/datasets/erdemtaha/cancer-data (accessed on 19 September 2023).
Hüllermeier, E.; Waegeman, W. Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods. Mach. Learn. 2021, 110, 457–506. [Google Scholar] [CrossRef]
Stracuzzi, D.J.; Chen, M.G.; Darling, M.C.; Peterson, M.G.; Vollmer, C. Uncertainty Quantification for Machine Learning; Technical Report SAND2017-6776; Sandia National Laboratories: Albuquerque, NM, USA, 2017. [Google Scholar]
Wei, H.; Xie, R.; Cheng, H.; Feng, L.; An, B.; Li, Y. Mitigating neural network overconfidence with logit normalization. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 23631–23644. [Google Scholar]
Smith, M.R.; Martinez, T. Improving classification accuracy by identifying and removing instances that should be misclassified. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2690–2697. [Google Scholar]
Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 2018, 31, 2346–2363. [Google Scholar] [CrossRef]
Targ, S.; Almeida, D.; Lyman, K. Resnet in Resnet: Generalizing Residual Architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report 0; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12)—Volume 1, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Deng, L. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 12–17 December 2011; p. 4. [Google Scholar]
Kylberg, G. The Kylberg Texture Dataset v. 1.0. External Report (Blue Series) 35; Centre for Image Analysis, Swedish University of Agricultural Sciences and Uppsala University: Uppsala, Swiden, 2014. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef]
Yang, J.; Wang, P.; Zou, D.; Zhou, Z.; Ding, K.; Peng, W.; Wang, H.; Chen, G.; Li, B.; Sun, Y.; et al. Openood: Benchmarking generalized out-of-distribution detection. Adv. Neural Inf. Process. Syst. 2022, 35, 32598–32611. [Google Scholar]
Lee, K.; Lee, K.; Lee, H.; Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Montreal, QC, Canada, 3–8 December 2018; pp. 7167–7177. [Google Scholar]
Wang, H.; Li, Z.; Feng, L.; Zhang, W. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4921–4930. [Google Scholar]
Sun, Y.; Ming, Y.; Zhu, X.; Li, Y. Out-of-distribution detection with deep nearest neighbors. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 20827–20840. [Google Scholar]
Sikora, M.; Wróbel, L.; Gudyś, A. Bone Marrow Transplant: Children. UCI Machine Learning Repository. Clin. Transplant. 2020, 3, 12–18. [Google Scholar] [CrossRef]
Fisher, R.A. Iris; UCI Machine Learning Repository: Los Angeles, CA, USA, 1988. [Google Scholar] [CrossRef]
Horst, A.M.; Hill, A.P.; Gorman, K.B. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data; R Package Version 0.1.0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]

Figure 1. How YADA compares with other discriminant methods.

Figure 2. Histograms of random samples of exponential and uniform random variables from samples of a standard Gaussian variable using Equation (1).

Figure 3. The connection between the feature space (left) and the space of multivariate Gaussian random variables (right) for the phoneme dataset [30], where

X_{i}

and

G_{i}

i = 1, 2

, are defined by Equation (3).

Figure 3. The connection between the feature space (left) and the space of multivariate Gaussian random variables (right) for the phoneme dataset [30], where

X_{i}

and

G_{i}

i = 1, 2

, are defined by Equation (3).

Figure 4. Contours of the joint pdf of

(G_{1}, G_{2})

(left) and

(X_{1}, X_{2})

(right).

Figure 4. Contours of the joint pdf of

(G_{1}, G_{2})

(left) and

(X_{1}, X_{2})

(right).

Figure 5. The K-L divergence between 10 YADA models trained to the MNIST dataset (left). The (right) panel illustrates the class separation score, which is a symmetric and normalized version of the K-L divergence, sorted in descending order.

Figure 6. Two-dimensional projections of the YADA classification regions

R_{i}

i = 0, 1, 2

, for the Fisher iris dataset.

Figure 6. Two-dimensional projections of the YADA classification regions

R_{i}

i = 0, 1, 2

, for the Fisher iris dataset.

Figure 7. Marginal likelihoods of a test point from the cancer dataset with class label M.

Figure 8. Explanations for two MNIST test images. The white pixels are the features that contribute the most to the joint likelihood score and can serve as explanations.

Figure 9. Prediction confidence

{conf}_{i} (x^{*})

defined by Equation (21) for Fisher iris data.

Figure 9. Prediction confidence

{conf}_{i} (x^{*})

defined by Equation (21) for Fisher iris data.

Figure 10. Prediction confidence for the MNIST dataset. The top row illustrates the 5 test images with the greatest confidence values. The bottom row illustrates the test images with the greatest and least confidence for both the ‘2’ class (left two images) and the ‘6’ class (right two images).

Figure 11. Scatter plots of the cancer dataset for 6 random pairings of the 30 available features. Each panel illustrates the 569 real training data (‘×’) and 500 synthetic data (‘•’) for both benign (‘B’) and malignant (‘M’) class labels.

Figure 12. Images of handwritten digits: MNIST training data (top row) and synthetic images produced by YADA (bottom row).

Figure 13. Synthetic data using high probability sampling for the Fisher iris data.

Table 1. AUC values for detecting OOD data. CIFAR-10 is considered in-distribution against several OOD datasets. Across all datasets, YADA is competitive with state-of-the-art approaches.

	MDist	ViM	DNN	YADA with Empirical cdf	YADA with KDE
CIFAR-100	85.95	87.45	89.75	87.35	88.31
TIN	87.63	89.68	91.71	89.32	87.89
MNIST	88.61	94.27	94.41	91.97	96.62
SVHN	91.83	94.48	93.01	90.96	92.72
Texture	93.78	94.77	93.02	91.63	86.86
Places365	86.63	89.19	92.10	90.02	88.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Field, R.V., Jr.; Smith, M.R.; Wuest, E.J.; Ingram, J.B. Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications. Mathematics 2024, 12, 3392. https://doi.org/10.3390/math12213392

AMA Style

Field RV Jr., Smith MR, Wuest EJ, Ingram JB. Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications. Mathematics. 2024; 12(21):3392. https://doi.org/10.3390/math12213392

Chicago/Turabian Style

Field, Richard V., Jr., Michael R. Smith, Ellery J. Wuest, and Joe B. Ingram. 2024. "Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications" Mathematics 12, no. 21: 3392. https://doi.org/10.3390/math12213392

APA Style

Field, R. V., Jr., Smith, M. R., Wuest, E. J., & Ingram, J. B. (2024). Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications. Mathematics, 12(21), 3392. https://doi.org/10.3390/math12213392

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Yet Another Discriminant Analysis (YADA): A Probabilistic Model for Machine Learning Applications

Abstract

1. Introduction

2. The YADA Model

2.1. Definition

2.1.1. Single Feature

2.1.2. Two or More Features

2.2. Properties

2.2.1. Marginal Distributions

2.2.2. Joint Distribution

2.2.3. Pairwise Correlations

2.2.4. Likelihood Functions

2.2.5. Conditional Distributions

2.2.6. Entropy and K-L Divergence

3. YADA for Machine Learning

3.1. Training

3.2. Classification

3.2.1. Likelihood

3.2.2. Mahalanobis Distance

3.2.3. Ensemble Method with Dimension Reduction

3.3. Explanations

3.4. Prediction Confidence

3.4.1. Empirical Approach

3.4.2. Approach Based on Theory

3.4.3. Applications

3.5. Synthetic Data Generation

3.5.1. Random Sampling

3.5.2. Conditional Sampling

3.5.3. High Probability Sampling

3.5.4. Applications

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Discrete-Valued Random Variables with YADA

Appendix B. Derivative of w(x)

Appendix C. Tail Modeling with YADA

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI