1 Introduction

Multivariate distribution forecasting, also termed probabilistic forecasting, is an important tool for risk management in financial and insurance systems. Point forecasts only predict one observation in each univariate distribution, typically the mean or a quantile,Footnote 1 but accurate probabilistic forecasts of the entire multivariate distribution are necessary in order to fully understand the risk and uncertainty in the system. Only then can one determine which model is most appropriate for decision making.

Alongside the profusion of models for generating distribution forecasts, a prolific strand of theoretical research now focusses on developing methods for evaluating these forecasts. However, we shall argue that this strand of the literature requires further development. In particular, we need new methods for evaluating the performance metrics, not the models themselves. That is, we shall examine the metrics that are commonly applied to determine the accuracy of distribution forecasts in multivariate systems, and investigate the circumstances under which one metric is better than another.

A typical exercise in forecast evaluation is to select a set of performance metrics and then consider how competing models perform according to these metrics. However, this approach has many pitfalls: the results are often very sample specific, and there is the additional and widespread additional problem of data snooping bias—see Anghel (2021) for a review. Our paper turns the problem around: rather than implementing a type of ‘horse race’ between competing models based on a sample of historical data, we perform an assessment of the performance metrics themselves. We only employ historical data to calibrate realistic parameters for a true data generation process. Once the model parameters are calibrated, we simulate data using the true data generation process, and then calibrate the parameters of several ‘wrong’ models to these simulated data. Then, armed with a set of forecasts from the true model and the wrong models, we evaluate the ability of different performance metrics to discriminate between the true model and the wrong models. This approach is not entirely new to forecasting performance evaluation—see Pinson and Tastu (2013) and Scheuerer and Hamill (2015). However, most previous results only use simple (Gaussian or Poisson) data generation processes as the true model. We review these in more detail below.

In recent years it has become standard to quantify the accuracy of predictions using a proper scoring rule (see, for example, Gneiting and Raftery 2007). A wide variety of different scoring rules are proposed and considered in the literature: Bao et al. (2007) advocate using the Kullback–Leibler information criterion which is derived from the logarithmic score; Gneiting and Raftery (2007) advocate using the continuous ranked probability score (CRPS) and Gneiting and Ranjan (2011) extend this to adopt the weighting approach of Amisano and Giacomini (2007) so that evaluation can be focused on a specific area of the distribution, such as a tail or the centre; Gneiting and Raftery (2007) generalise the CRPS to the energy score for multivariate distributions; Scheuerer and Hamill (2015) use the concept of variograms from geostatistics to derive another proper multivariate score; Hyvärinen (2005) and Parry et al. (2012) consider local scoring rules, which are based on the density function or its derivatives; and Diks et al. (2011) and Diks et al. (2014) generalise the log score to conditional likelihood and censored likelihood, which can allocate more weights on specific regions.

Proper scoring rules have also found diverse applications in different disciplines. Amongst many others: Hora and Kardeş (2015) use the quadratic score as the cost function to optimize the weights in forecasting model combining; Alexander et al. (2019) use the CRPS to evaluate the accuracy of risk models; Shang et al. (2019) use interval scores to evaluate forecasts of a volatility index; and Mahbobi et al. (2021) use the Brier score and cross entropy loss in the study of credit risk assessments.

Given the large number of scoring rules, conventional wisdom dictates to apply a suitable one for the application at hand. While it is generally agreed upon that only proper scoring rules quantify the accuracy of probabilistic forecasts adequately (Winkler 1996; Gneiting and Ranjan 2011), the question of which of the proper scores to use remains largely open (Gneiting and Raftery 2007). This problem is especially relevant for multivariate evaluation, since the rankings of univariate scoring rules mostly coincide, which reduces the risk of conflicting conclusions (see, for example, Staël von Holstein 1970; Winkler 1971; Bickel 2007).

Some previous studies provide an analytic assessment of proper scoring rules for a specific forecasting problem, but these are restricted to a univariate setting in which some strong assumptions are made (see, for example, Machete 2013; Buja et al. 2005; Merkle and Steyvers 2013; Johnstone et al. 2011). Previous studies comparing multivariate proper scoring rules are limited to very simple simulation settings. In their evaluation of the energy score, Pinson and Tastu (2013) restrict themselves to a bivariate Gaussian as the true distribution. They introduce a useful discrimination heuristic which measures the average relative distance of the sub-optimal scores (i.e. those obtained from mis-specified models) from the score obtained from the true distribution. They conclude that the energy score is able to discriminate errors in mean but lacks sensitivity to errors in variance and especially to errors in correlation.Footnote 2 Scheuerer and Hamill (2015) compare three different variogram scores with the energy score and the score proposed by Dawid and Sebastiani (1999). Their system has 5 or 15 dimensions but again they only consider a very simple true model, i.e. a Gaussian or Poisson distribution. A simulation study assesses the ability of each score to identify the correct model through a simple heuristic, i.e. the score sample mean. Overall, only the variogram score with \(p=0.5\) is devoid of ranking issues in their study. Also, they confirm the finding of Pinson and Tastu (2013) that the energy score lacks sensitivity to misspecification in the dependency structure. Ziel and Berk (2019) compare the sensitivity of the energy score, variogram score and the score proposed by Dawid and Sebastiani (1999). They first consider a simulation study focusing on multivariate Gaussian distributions of unit variances and different correlations. Then they consider a time series empirical study where autoregressive (AR) models with different lags are applied to airline passengers. They conclude that the energy score performs the best in their simulation study, which seems to contradict the conclusion from Pinson and Tastu (2013). Diks and Fang (2020) use conditional likelihood and censored likelihood to compare multivariate and univariate approaches for forecasting portfolio returns. Their empirical study considers the univariate and multivariate generalized autoregressive conditional heteroscedasticity (GARCH) models with elliptical distributional assumptions applied to financial data. Their focus is not to compare scoring rules, but to illustrate that multivariate models do not necessarily lead to better forecasting accuracy for portfolio returns when compared with a univariate approach.

We make two major contributions to this literature. The first is to propose new metrics for discriminating between correctly specified and misspecified models using proper multivariate scoring rules. The second is to provide the first empirical study on real financial data of the relative performance of multivariate scoring rules. Most empirical research in the forecasting literature in finance and economics includes a short empirical study, but extensive applications of proper scoring rules, even to univariate distribution forecasts of financial returns, are hard to find. We fill this gap in the literature by conducting an extensive comparison between different rules across a diverse selection of eight static and dynamic models for multivariate distributions. We do this in a forecasting setting using three large sets of daily financial data spanning 1991 to 2018: commodity prices, exchange rates and interest rates. Each data set has eight dimensions. We apply our new metrics to draw conclusions about the ability of different proper scoring rules to detect model misspecification. We do this by calibrating each model to one of the datasets and then simulating day-ahead values using the calibrated model. The exercise is repeated quarterly across many years and, every time, results are generated with each of eight different models as the true data generation process.

In the following: Sect. 2 describes the scoring rules; Sect. 3 describes the models for multivariate distribution predictions; Sect. 4 describes the design of the experiment; Sect. 5 presents our results and Sect. 6 concludes.

2 Review of scoring rules

Forecasting accuracy evaluations rely on loss measures to quantify the performance of a distribution forecast. As Diebold and Mariano (1995) mention, this loss generally depends on the underlying economic structures associated with the forecast. Scoring rules offer a promising measure by condensing the accuracy of a distribution forecast to a single penalty oriented value while retaining attractive statistical properties.

Let \({\mathcal {F}}\) be the convex class of distributions on \((\Omega ,{\mathcal {A}})\). Let y be an observation of the random variable Y and let F be a forecast of the distribution of Y. A scoring rule is a function

$$\begin{aligned} S(F,y):{\mathcal {F}}\times \Omega \longrightarrow {\mathbb {R}}\cup \{-\infty ,\infty \} \end{aligned}$$

A scoring rule S is said to be proper if for all distributions F and the true distribution G, the following holds,

$$\begin{aligned} {\mathbb {E}}_{G}S(G,\cdot )\le {\mathbb {E}}_{G}S(F,\cdot ). \end{aligned}$$
(1)

Further, a scoring rule is strictly proper if Eq. (1) holds if G is the unique minimiser.

Propriety of a scoring rule is important because the ideal forecast is preferred irrespective of the cost-loss structure (Diebold et al. 1998; Granger and Pesaran 2000). A proper scoring rule is designed so that a forecaster who believes the future distribution to be G has no incentive to predict any distribution \(F\ne G\) (Gneiting et al. 2007). The term has been coined by Winkler (1996, 1977) who shows that proper scoring rules test for both calibration and sharpness of a distribution forecast simultaneously. The usage of non-proper scoring rules is generally not recommended since those can lead to wrong inferences (Gneiting and Ranjan 2011).

The most well-known score is undoubtedly the log score, defined as follows,

$$\begin{aligned} \text{ LogS }(F,y)=-\log (f(y)), \end{aligned}$$

where f is the probability density function (PDF) of the distribution F. The log score has been widely applied in both model evaluation and estimation. In the log score, the PDF f clearly cannot have value 0. To overcome this weakness, quadratic and pseudo-spherical scores have been considered as alternative PDF-based scores

$$\begin{aligned} \text{ QS }(F,y)= & {} \Vert f\Vert _{2}^{2}-2f(y),\\ \text{ PseudoS }(F,y)= & {} f(y)^{\alpha -1}/\Vert f\Vert _{\alpha }^{\alpha -1}. \end{aligned}$$

where \(\Vert f\Vert _{\alpha }{:}{=}\left( \int f(y)^{\alpha }\mu \,\text{ d }y\right) ^{1/\alpha }.\) The spherical score is a special case of the pseudo-spherical score with \(\alpha =2\). All three scores are strictly proper. However, these scores have been criticised for only crediting forecasts for high probabilities of the realizing value but not for high probabilities of values near the realizing one (Gneiting and Raftery 2007). In addition, the PDF-based scores rely on predictive densities which might not be available, especially with ensemble forecasts. In view of these limitations, efforts have been devoted to developing alternative scores that do not require predictive densities. In the rest of section, we briefly review three such scores: the CRPS, energy score and variogram score.

2.1 Conditional ranked probability score

The CRPS introduced by Matheson and Winkler (1976) and augmented by Gneiting and Ranjan (2011) has been widely used to properly compare distribution forecasts with a potential focus on certain regions of interest. The CRPS is defined as

$$\begin{aligned} \text{ CRPS}_{\nu }(F,y)&=\int _{0}^{1}\text{ QS}_{\alpha }(F^{-1}(\alpha ),y)\nu (\alpha )\,\text{ d }\alpha ,\nonumber \\ \text{ QS}_{\alpha }(F^{-1}(\alpha ),y)&=2(\mathbbm {1}\{y\le F^{-1}(\alpha )\}-\alpha )(F^{-1}(\alpha )-y) \end{aligned}$$
(2)

where \(F^{-1}(\alpha )=\inf \{x|F(x)=\alpha \}\) denotes the \(\alpha \) quantile of F, \(\nu :[0,1]\rightarrow {\mathbb {R}}_{\ge 0}\) is a quantile weight function and \(\text{ QS}_{\alpha }\) is known as the quantile score.

Apart from this quantile score representation, CRPS can also be expressed using the Brier probability score through

$$\begin{aligned} \text{ CRPS}_{u}(F,y)&=\int _{-\infty }^{\infty }\text{ PS }(F(z),\mathbbm {1}\{y\le z\})u(z)\,\text{ d }z,\nonumber \\ \text{ PS }(F(z),\mathbbm {1}\{y\le z\})&=\left( F(z)-\mathbbm {1}\{y\le z\}\right) ^{2} \end{aligned}$$
(3)

with threshold weight function \(u:{\mathbb {R}}\rightarrow {\mathbb {R}}_{\ge 0}\) as shown by Laio and Tamea (2007). Given a realization y, the integral of Eq. (3) splits into two easily interpretable parts which get penalized by the score as visualized in Fig. 1. We remark that Eqs. (2) and (3) are generally not equivalent.

Additionally, Gneiting and Raftery (2007) derive the kernel score representation

$$\begin{aligned} \text{ CRPS}_{u}(F,y)&={\mathbb {E}}_{F}\left( Y-y\right) -\frac{1}{2}{\mathbb {E}}_{F}\left( Y-Y'\right) , \end{aligned}$$
(4)

where Y and \(Y'\) are independent random variables with sampling distribution F. This concise expression serves as a foundation for the generalization of CRPS to the multivariate energy score discussed in the next subsection.

Fig. 1
figure 1

CRPS Schematic. We use \(F=\Phi \) the standard normal distribution function, and \(y=1.2\) to illustrate the concept of the CRPS. The forecasted distribution F is penalized for the shaded area left and right of the realized value y through \({\int _{-\infty }^{y}F(z)^{2}\,\text{ d }z}\) and \({\int _{y}^{\infty }(1-F(z))^{2}\,\text{ d }z}\) respectively. A low score suggests high sharpness of the distribution forecast around the realization

For distributions with finite first moment, the CRPS is strictly proper. Thus, the true probability function receives the lowest CRPS and is preferred to any other probabilistic forecast. Emphasizing specific parts of the distribution by the choice of the quantile or threshold weight functions is simple since any non-negative functions \(\nu \) and u can be used, provided that Eqs. (2) and (3) are convergent. Table 1 lists the proposed functions by Amisano and Giacomini (2007) that we use in our analysis.

Table 1 Possible weight functions for CRPS

2.2 Energy score

The energy score is a popular multivariate strictly proper score introduced by Gneiting and Raftery (2007) which generalizes the kernel representation of the CRPS in Eq. (4). Let \({\mathbf {y}}=(y_{1},\ldots ,y_{d})'\) be an observation of the random vector \({\mathbf {Y}}\) and let F be a forecast of the distribution of \({\mathbf {Y}}\) such that \({\mathbb {E}}_{F}(\Vert {\mathbf {Y}}\Vert ^{\beta })\) is finite. The energy score is then defined as

$$\begin{aligned} \text{ ES}_{\beta }(F,{\mathbf {y}})=\frac{1}{2}{\mathbb {E}}_{F}\left( \Vert {\mathbf {Y}}-{\mathbf {Y}}'\Vert ^{\beta }\right) -{\mathbb {E}}_{F}\left( \Vert \mathbf {Y-{\mathbf {y}}}\Vert ^{\beta }\right) , \end{aligned}$$

where \({\mathbf {Y}}\) and \({\mathbf {Y}}'\) are independent random vectors with distribution F.

Székely (2003) shows that the energy score with \(\beta \in (0,2)\) is strictly proper while Gneiting and Raftery (2007) provide an alternative general proof. In practice, \(\beta =1\) has been a common choice. In this case, the energy score reduces to the CRPS in the univariate case. Closed form expressions of the energy score are generally unavailable which means that computations are done through Monte Carlo methods.

Despite its popularity, this score has been criticized for being insensitive to misspecification of the dependency structure (Pinson and Girard 2012; Pinson and Tastu 2013) and for being unable to distinguish a good representation of the predictive distribution from a very sparse one (Scheuerer and Hamill 2015).

2.3 Variogram score

The variogram score proposed by Scheuerer and Hamill (2015) is based on the concept of variograms from geostatistics. Similar to diagnostic methods by Hamill (2001) and Feldmann et al. (2015), the score examines pairwise element differences of the variables \(y_{i}\) of \({\mathbf {y}}\). The variogram score of order p is defined as

$$\begin{aligned} \text{ VS}_{p}(F,{\mathbf {y}})=\sum _{i=1}^{d}\sum _{j=1}^{d}\left( |y_{i}-y_{j}|^{p}-{\mathbb {E}}_{F}\left( |Y_{i}-Y_{j}|^{p}\right) \right) ^{2}, \end{aligned}$$

where \(Y_{i}\) and \(Y_{j}\) are the i-th and j-th components of a random vector \({\mathbf {Y}}\) with distribution F. Intuitively, the score makes use of the variogram of order p ,

$$\begin{aligned} \gamma _{p}(i,j)=\frac{1}{2}{\mathbb {E}}\left( |Y_{i}-Y_{j}|^{p}\right) , \end{aligned}$$

which quantifies the degree of spatial dependence of a stochastic process. Pairwise comparisons measure the closeness of the deviations in the observations relative to those of the corresponding expectations.

Scheuerer and Hamill (2015) show that the score is proper relative to the class of distributions for which the 2p-th moments of all elements are finite. The variogram score is not strictly proper because it depends only on the p-th absolute moment of the distribution of the element differences. Therefore, it cannot distinguish any distributions where the element differences deviate in higher moments but are the same for moments of order less than or equal to p.

In practice, the choice of p depends on the forecasted distribution and should generally be large enough to consider all relevant moments of the pairwise deviations but not too large to overly emphasize outliers through the exponentiation. The values \(p=0.5,1,2\) are often suggested, and Fig. 2 shows the effect of p by illustrating the observed variogram \(|y_{i}-y_{j}|^{p}\) of different popular orders relative to changes in \(|y_{i}-y_{j}|\). It is clearly visible that the magnitude of the effect depends heavily on the value of \(|y_{i}-y_{j}|\) with the absolute slope varying between 0 and 3 in the depicted domain \((-1.5,1.5)\). The sensitivity of the observed variogram in a neighbourhood of zero deviation is strongest for \(p=0.5\) and very weak for \(p=2\). This order reverses for \(|y_{i}-y_{j}|>(1/4)^{2/3}\). We expect the parameter \(p=0.5\) to be more sensitive to similar \(y_{i}\) and \(y_{j}\) while \(p=2\) will respond more to cases where \(|y_{i}-y_{j}|\) is expected to be large. The choice \(p=1\) should be able to strike a balance between these two cases.

Fig. 2
figure 2

Variogram observation of various orders. The figure shows the effect of the variogram order depending on the observed absolute difference \(|y_{i}-y_{j}|\). Slight deviations in \(|y_{i}-y_{j}|\) affect the observed variogram \(|y_{i}-y_{j}|^{p}\) differently, depending on its order p

As with the energy score, the encapsulation of the information into a single score leads to a loss of information. However, empirical applications support the sensitivity of the score to flawed forecasts, especially regarding the dependency structure (Scheuerer and Hamill 2015). Since the score is based on pairwise deviations, any bias that is the same for all components of the forecast cancels out and is therefore undetectable. This further motivates the good practice to using multiple proper scores for the evaluation of multivariate distribution forecasts.

3 Models for predicting multivariate distributions

While most prior studies of this type have only considered static parametric distributions with fixed parameters as data generating process (DGP) and parameter variations for the misspecified models, we aim here to instead select from a broad range of realistic models for both the DGP and the misspecified models. There is a very long list of alternative static and dynamic models that could be selected for our study. We require a diverse selection but would like all of these to fit the historical datasets reasonably well, while differing in features such as whether the marginal distributions are static, whether the dependence structure is static, how much of the noise is filtered from the historical data,Footnote 3 and finally how much sampling error is generated by the length of the calibration window. In this section we summarize all the models to be used in our study, beginning with static models before moving to dynamic models.

3.1 Static models

Our simplest model assumes that the next period joint distribution of the variables can be well approximated by their joint historical distribution, i.e. the frequency distribution of contemporaneous observations on the random variables. The lack of complexity makes it this approach particularly popular with practitioners. In a survey by Pérignon and Smith (2010) of risk models used by commercial banks, almost 3/4 used such historical simulation techniques. Moreover, Danielsson et al. (2016) shows that it remains unclear whether dynamic time-series models can outperform this so-called ‘historical’ approach, at least for predicting quantiles. We therefore include models that use historical data to build an empirical distribution function (EDF) for each marginal and apply the Gaussian copula for the dependence structure. More sophisticated copulas could be selected but we only apply the Gaussian copula for consistency with the other models employed. We use EDF\(_n^C\) to denote the empirical marginal distribution with Gaussian copula, calibrated to a sample size n.

The Factor Quantile (FQ) methodology developed by Alexander et al. (2021) uses quantile regression to capture the marginals. Each random variable in the system is regressed on a set of m common factors, which may be exogenously selected or determined endogenously. Here we use the latter approach, with latent factors \({\mathbf {x}}_m\) derived from a principal component analysis of all the variables in the system. Given a quantile partition \({\mathcal {Q}}\), common to each variable, we perform a set of quantile regressions for each variable, i.e. one for each quantile \(\tau \in {\mathcal {Q}}\). Then, conditional on a set of predetermined values (one for each of the latent factors) we compute the fitted quantiles of each variable, and apply a shape-preserving spline to the fitted quantiles over all \(\tau \in {\mathcal {Q}}\). This way we interpolate a conditional marginal distribution for each variable, all conditional on the same latent-factor values. Alexander et al. (2021) claim that such latent-factor FQ models are at least as successful for predicting distributions of financial asset returns as multivariate GARCH models. Furthermore, they have several advantages in that they are very quick to calibrate, more computationally robust than GARCH models, and they scale far more easily to high dimensions.

Here we confine our selection to two types of latent factor FQ models, each with a Gaussian copula capturing dependence, which are designated FQ-AL\(_{n}^{C}\) and FQ-AB\(_{n}^{C}\) where n is again the number of observations used to calibrate the model and C denotes the Gaussian copula. The difference between the AL and AB versions stems from the choice of factors \({\mathbf {x}}_m\): the AB version uses the first m components and the AL version uses the last m principal components. The components are ordered by size of eigenvalue, so that the variance explained by each component progressively decreases. The AL version simply separates relevant information (captured by the intercept) from the noise (captured by the last m components). In effect, it is a means of removing unwanted variation from the EDF. This may also be viewed as a latent-factor version of the Jensen (1968) ‘alpha’, hence the terminology ‘alpha-latent’ or AL. The AB version uses the first m eigenvectors as latent factors, but in this case each quantile forecast is associated with a large uncertainty. Therefore a variance reduction technique based on “bootstrap aggregation” or bagging is applied to reduce this uncertainty, via an algorithm proposed by Breiman (1996). We refer to Alexander et al. (2021) for further technical details.

3.2 Dynamic models

The most common class of dynamic model for an expected value is the family of autoregressive moving average (ARMA) time-series models—because they are flexible, have the temporal aggregation property, and allow both easy estimation and straightforward inference. However, for distribution forecasts we also require a parametric forecast of variance. To this end, we employ time-series models for conditional variance forecasts using the GARCH(pq) model class, a generalization by Bollerslev (1986) of the autoregressive conditional heteroscedasticity (ARCH) model of Engle (1982). The GARCH(pq) process assumes a time-varying conditional variance as follows:

$$\begin{aligned}&\sigma _{t}^{2} =\omega +\sum _{i=1}^{q}\alpha _{i}\varepsilon _{t-i}^{2}+\sum _{i=1}^{p}\beta _{i}\sigma _{t-i}^{2} \end{aligned}$$
(5)
$$\begin{aligned}&\varepsilon _{t}|{\mathcal {I}}_{t-1} \sim {\mathcal {N}}(0,\sigma _{t}^{2}), \end{aligned}$$
(6)

where \(\varepsilon _{t}\) is the market shock or innovation at time t and \({\mathcal {I}}_{t}\) is the information set containing all past returns up to t. The parameters \((\alpha _{i})_{i=1}^{q}\) and \((\beta _{i})_{i=1}^{p}\) measure the reaction of the conditional variance to market shocks and the persistence of conditional variance respectively.

Various extensions exist for both the conditional variance and conditional mean equation. We refer to Teräsvirta (2009) for a survey of popular GARCH specifications. Most notably, the exponential GARCH (E-GARCH) is an asymmetric extension by Nelson (1991) that specifies the conditional variance equation in terms of log rather than directly. Given the asymmetric response function

$$\begin{aligned} g(z_{t})=\theta z_{t}+\gamma \left( |z_{t}|-{\mathbb {E}}(|z_{t}|)\right) \end{aligned}$$
(7)

and \(z_{t}=\varepsilon _{t}/\sigma _{t}\), the conditional variance and mean equations are

$$\begin{aligned} \begin{aligned}&r =c+\sigma _{t}z_{t},\\&\log \left( \sigma _{t}^{2}\right) =\omega +\sum _{i=1}^{q}\alpha _{i}g(z_{t-i})+\sum _{i=1}^{p}\beta _{i}\log \left( \sigma _{t-i}^{2}\right) . \end{aligned} \end{aligned}$$
(8)

Positive and negative shocks can affect the variance differently, depending on the choices for \(\gamma \) and \(\theta \). This leverage effect is especially relevant in equity and commodity markets where the asymmetry is well documented. The function g accounts for only positive shocks for \(\theta =\gamma \), only negative shocks for \(\theta =-\gamma \) and can model a large variety of reactions in between.

These models have been particularly successful in modelling univariate time series of financial returns (Engle 2001) partly because they are designed to capture volatility clustering effects which are commonly observed in finance (Mandelbrot 1963). In the presence of such effects, volatility becomes time dependent and can exhibit prolonged periods of exceptionally high or low values.Footnote 4 Multivariate GARCH models are designed to capture time variation in conditional variances and conditional covariances. In multivariate analysis, clustering extends beyond volatilities to correlations and generalizations of univariate GARCH models also capture time varying conditional covariances and spillover of volatility between different assets. Bollerslev (1990) introduces the constant conditional correlation GARCH (CCC-GARCH) model where the conditional correlations are assumed to be time invariant. The covariance matrix is estimated as

$$\begin{aligned} \varvec{\epsilon _{t}}&={\mathbf {V}}_{t}^{1/2}{\mathbf {z}}_{t}\\ {\mathbf {V}}_{t}&={\mathbf {D}}_{t}{\mathbf {C}}{\mathbf {D}}_{t},\\ {\mathbf {D}}_{t}&=\text{ diag }\left( {\mathbf {V}}_{t}\right) ^{1/2}, \end{aligned}$$

where \({\mathbf {V}}_{t}^{1/2}\) is a matrix such that \({\mathbf {V}}_{t}\) is the covariance matrix (e.g. \({\mathbf {V}}_{t}^{1/2}\) can be chosen as the Cholesky factorization of \({\mathbf {V}}_{t}\)) \({\mathbf {z}}_{t}\) is the standardized residual which has mean zero and the identity covariance matrix, \({\mathbf {C}}\) is a constant correlation matrix and \({\mathbf {D}}_{t}\) is the diagonal matrix containing the time-varying individual volatilities. In practice, each volatility can be estimated by an univariate GARCH model while \({\mathbf {C}}\) can be specified as the sample correlation between standardized residuals. The model is easy to estimate since dependency and the volatilities are examined separately.

The assumption of constant correlation may seem too strong and is not fulfilled for many assets (Tsui and Yu 1999). Dynamic conditional correlation GARCH (DCC-GARCH) by Engle (2002) generalizes CCC-GARCH to account for time-varying correlations. The correlation is estimated directly from the residuals of the univariate models and adjusted depending on the co-movement of the returns. As such, the covariance matrix is given by

$$\begin{aligned} {\mathbf {V}}_{t}={\mathbf {D}}_{t}{\mathbf {C}}_{t}{\mathbf {D}}_{t},\qquad {\mathbf {D}}_{t}=\text{ diag }\left( {\mathbf {V}}_{t}\right) ^{1/2}, \end{aligned}$$

where the conditional correlation \({\mathbf {C}}_{t}\) is described by

$$\begin{aligned} \begin{aligned}{\mathbf {C}}_{t}&=\text{ diag }\left( {\mathbf {Q}}_{t}\right) ^{-1/2}\,{\mathbf {Q}}_{t}\,\text{ diag }\left( {\mathbf {Q}}_{t}\right) ^{-1/2},\\ {\mathbf {Q}}_{t}&=\left( 1-\sum _{m=1}^{M}\alpha _{m}-\sum _{n=1}^{N}\beta _{n}\right) \overline{{\mathbf {Q}}}+\sum _{m=1}^{M}\alpha _{m}\left( \varvec{\varepsilon }_{t-m}\varvec{\varepsilon }_{t-m}'\right) +\sum _{n=1}^{N}\beta _{n}{\mathbf {Q}}_{t-n}, \end{aligned} \end{aligned}$$
(9)

with

$$\begin{aligned} \overline{{\mathbf {Q}}}={\mathbb {E}}\left( \varvec{\varepsilon }_{t}\varvec{\varepsilon }_{t}'\right) . \end{aligned}$$

The transformation of \({\mathbf {Q}}_{t}\) to \({\mathbf {C}}_{t}\) guarantees a well-defined correlation matrix as long as \({\mathbf {Q}}_{t}\) is positive definite. Similar to CCC-GARCH, there are no restrictions on the choice of univariate GARCH models for the volatility. For both the CCC-GARCH and DCC-GARCH models, \({\mathbf {z}}_{t}\) can be chosen as the multivariate Gaussian distribution, which leads to consistent parameter estimators even under distributional misspecification (Bauwens and Laurent 2005). It is also possible to consider non-normal distributions. For example, Bauwens and Laurent (2005) use multivariate skew distributions while Cajigas and Urga (2006) and Pelagatti (2004) apply the Laplace distribution and general elliptical distributions, respectively.

We replace the univariate Gaussian GARCH(1,1) for CCC- and DCC-GARCH with Student-t E-GARCH(1,1). This choice is motivated by Hansen and Lunde (2005) who provide an extensive comparison of 330 univariate GARCH specifications through the Hansen (2005) superior predictive ability data-snooping check, concluding that it is hard to beat an asymmetric GARCH(1,1) model with Student-t innovations.

This paper focuses on the performance of proper scoring rules, and not on the models themselves. However, the simulation study described in the next section is extremely time consuming—even with the powerful computers of today. For this reason, we need to limit the number of models selected. Although there are several alternative multivariate GARCH models, we limit our analysis to CCC-GARCH(1,1) and DCC-GARCH(1,1) and for the marginal distribution, we use a Student-t E-GARCH(1,1). For simplicity, we still use CCC-GARCH(1,1) and DCC-GARCH(1,1) to denote these asymmetric multivariate GARCH models with Student-t EGARCH(1,1) marginals – see Table 4 below. Finally, we note that the choice of lags \(p,q=1\) is standard in the literature. Further lags could be considered, of course, but it is typically found that more lags lead to little, if any, improvement in forecasting accuracy, at least for the type of financial data that we consider here. There is a considerable amount of research supporting this conclusion, the most comprehensive of which is Hansen and Lunde (2005).

4 Simulation study

Our discussion in Sect. 2 introduced several univariate and multivariate proper scoring rules that are prevalent in the literature. While it is agreed upon that these offer a sound way to quantify the accuracy of probabilistic forecasts (Winkler 1996; Gneiting and Ranjan 2011), the question of which score to use remains largely open (Gneiting and Raftery 2007). Conventional wisdom dictates to apply a suitable scoring rule for the application at hand (Machete 2013) but this only provides a few requirements and does not restrict the selection sufficiently.Footnote 5

Our aim is to analyse the ability of different scoring rules to discriminate between true and misspecified distributions. Improving on previous studies we employ a realistic simulation setting that approximates the conditions of practical applications in finance. This is reflected by our simulation design which calibrates the models described in the previous section to daily USD-denominated exchange rates from 1999 to 2018; US interest rates from 1994 to 2018; and Bloomberg investable commodity indices from 1991 to 2018.

In the following: Sect. 4.1 describes the data we use to calibrate the models used in the simulation study. Sect. 4.2 motivates our choice of models. Sect. 4.3 describes the design of the simulation study. All results are discussed in Sect. 5.

4.1 Data description

We obtain three eight-dimensional time series of daily frequency: the first is on USD-denominated exchange rates, the second is on US interest rates and the third is on Bloomberg investable commodity indices. We obtain the daily exchange rates and commodity index values from Thomson Reuters Datastream and the interest rates data from the US Treasury website. All time series end on 30 June 2018 but the start date varies with data availability. The different start dates should have no major impact on our conclusions because we consider an extremely long history for both in-sample and out-of-sample data. Within each set we have selected variables to broadly represent the asset class. The exchange rates are those with the highest trading volume excluding the Chinese Renminbi, which was pegged to the USD until recently (Bank of International Settlements 2016). Our data starts in January 1999 with the introduction of the Euro as accounting currency. The interest rates span the term structure of US Treasury bonds from 6 months to 20 years. Alternative available maturities are 1 month, 2 month, 3 month, and 30 years but those maturities are missing data for an extended period of time and are therefore excluded. Our data starts in January 1994 after the 20-year maturity interest rate became available in October 1993. The commodity indices are chosen to reflect the most liquid commodities with the highest USD-weighted production value and are diversified to represent the energy, grains, industrial/precious metals, softs and livestock sectors (Bloomberg 2017). The Bloomberg commodity indices were launched in 1998 with a backward projection to January 1991. We include all available data in our study. We summarize the total sample period and the starting date of our out-of sample evaluation in Table 2.

Table 2 Sample for each data set

Summary statistics of the data are listed in Table 3 for monthly returns of exchange rates and commodites, and for monthly changes in interest rates. The sample mean for all monthly returns and changes is near zero which allows us to apply the principal component representation of Factor Quantile models without prior transformations. Furthermore, all assets are leptokurtic and require heavy-tailed distributions.

Table 3 Summary statistics of the monthly returns/changes

To examine the robustness of our analysis, we segment the data into three parts, covering 2006–2010, 2010–2014, and 2014–2018 with breakpoints at the end of June in each case. Because the exchange rate data starts much later than the other data sets, the period from June 2006 to February 2007 is still used for calibration. Therefore, the first sub-period begins in March 2007 in this case.

4.2 Model selection

Although there have been a few studies explicitly comparing scoring rules (as reviewed in Sect. 1), the models chosen and the corresponding discussions generally fail to put enough emphasise on the distinct features of financial data, such as time-varying volatility, covariance and leptokurtic conditional distributions. Pinson and Tastu (2013) restrict themselves to bivariate Gaussian distributions with different means, variances and correlations, and do not consider time series in their study. Similarly Scheuerer and Hamill (2015) only consider simple distributions such as Gaussian and Poisson distributions.Footnote 6 Ziel and Berk (2019) conduct a time series empirical study but they only consider AR models for the mean with no dynamic volatility structure. In addition, their data is essentially a univariate time series, where observations on multiple days are viewed jointly as multivariate distributions. This is inherently different from standard financial applications, where synchronised financial data are viewed as a multivariate system. Diks and Fang (2020) do consider financial applications, in which they use the univariate and multivariate GARCH models to capture the dynamic variance and covariance structure. However, as we explained in Sect. 1, their focus is not to compare scoring rules.

Most of the empirical literature on financial forecasting accuracy concerns univariate times series and, moreover, considers only point forecasts. Yet, even in this restricted context, it remains unclear whether a dynamic model structure is always preferable to a static one which assumes that the next period distribution of variables can be well approximated by the historical distribution. Furthermore, multivariate models can differ both according to the specification of the marginals and their dependence structure. As described in Sect. 3, a static marginal distribution is commonly represented by an empirical distribution function, while we also consider models in the Factor Quantile class. The dynamic models are instead from the GARCH class. Regarding dependence, all but one of our models assumes a static dependence structure (typically the Gaussian copula)—only the DCC-GARCH models extend this assumption to dynamic conditional correlation.

Our simulation study selects a DGP as the true distribution and assesses the ability of different proper scoring rules to detect model misspecification. To reduce dependence on our choice of DGP we employ a diverse selection of models for multivariate distributions that can be used as either the DGP or a misspecified model. Also, for our results to be relevant for practical applications we calibrate these models to a very comprehensive selection of real-world data. However, even though the models are calibrated to actual data, it is important to emphasize that we do not seek to compare the forecasting performance of different models. Instead, our focus is on a comparison of the ability of different proper scoring rules to detect model misspecification. Nonetheless, we do choose models which have been shown in Alexander et al. (2021) to have similar forecasting abilities in typical financial datasets, thereby posing a greater challenge to scoring rules in terms of identifying the DGP and discriminating between competing models. As compared with other studies, we employ a more realistic setting with models that are better representations of financial data, we calibrate these models to actual data and our results are much more comprehensive.

We employ a sample size \(n=250\) and \(n=2000\) for our results based on static models. However, the dynamic multivariate GARCH models cannot be calibrated using \(n=250\) because the optimization of their complex likelihood function does not converge, most of the time. Hence, their parameters are estimated using \(n=2000\) only. The FQ models are much easier to calibrate on small samples, so we present results for these models based on both \(n=250\) and \(n=2000\).Footnote 7 Because the EDF and FQ models are calibrated on two different sample sizes we end up with eight competing models in total: (i) two FQ-AL models, (ii) two FQ-AB models, (iii) two EDF models and (iv) two multivariate GARCH models. This way, each model has one associated model that is similar but differs either in the calibration length or the conditional covariance structure. Table 4 summarizes the models employed.

Table 4 Summary of models used in the simulation study

Our motivation for selecting the models described here is to use a small but representative selection from the static and dynamic modelling paradigms that are commonly used for multivariate financial systems. Expanding the model set beyond the eight models described in this section may hinder rather than help our aim. Our selection of models strikes a balance between the need to represent the salient qualities of real data adequately and a sufficiently simple structure that calibration difficulties will not interfere with the very large scale of our simulation study.

4.3 Simulation design

Our simulation study quantifies and compares the ability of the energy score and the variogram score with \(p=0.5,1,2\) to distinguish the correct DPG from misspecified models. These scoring rules were defined in Sect. 2 and the selected values of p, which have also been used by Scheuerer and Hamill (2015) are considered typical choices (Jordan et al. 2019).

The models described above are calibrated on the systems of daily, eight-dimensional USD-denominated exchange rates, interest rates and Bloomberg investable commodity indices that we discussed in Sect. 4.1. The parameters for each model are estimated on an initial sample size n, then the calibrated model is used to generate a one-day-ahead multivariate distribution forecast, then the sample is rolled forward keeping the sample size fixed, and then the calibration and forecasting is repeated until the data set is exhausted. Then we move to another data set and repeat, until we have a continuous set of daily, rolling distribution forecasts, for each model and for each of the three data sets.

Our aim is not to assess the forecasting performance of different models. It is to assess the ability of a scoring rule to distinguish the true DGP from misspecified models, when used for multivariate distribution forecasting. Our simulation setting controls the DGP so that we know the true distribution every time we evaluate the forecasts and the other seven models are misspecified. For the forecasts made at time t all the models are calibrated using the n most recent observations up to time t, as previously mentioned. In addition, we select one of these models to generate the observations for time \(t+1\) that are used to analyse the performance of scoring rules. Put another way, realisations are simulated from just one of these models—i.e. the one that is designated as the DGP. The reason why we have selected such diverse models is that we seek to reduce the dependence of our results on the choice of a specific DGP. Therefore, we rotate the choice of DGP to span all eight models.

The simulation algorithm is summarized as follows. First, fix a proper scoring rule s and a model \(m^*\) that is designated as the true DGP. This rule s will quantify each of the eight model’s distribution forecasts, for each data set, as follows:

  1. Stage 1

    Calibrate a model using data ending at time t and forecast a multivariate distribution for time \(t+1\), i.e. one-day-ahead. Repeat this eight times, once for each model in Table 4;

  2. Stage 2

    Draw 5000 observations from the distribution forecasted by model \(m^*\). These observations are realisations from the true model at time \(t+1\);

  3. Stage 3

    Given a distribution forecast for time \(t+1\) by model m, apply s to each of the 5000 realisations generated in Stage 2, to quantify the performance of model m. Repeating this for all models in Table 4 gives 5000 scores for each of the eight models at time t including the designated DGP model \(m^*\);

  4. Stage 4

    Roll t forward by one quarter of a year and repeat stages 1, 2 and 3 until the entire sample for that data set is exhausted.

The four stages outlined above are then repeated eight times, so that each of the models in Table 4 is selected as the designated DGP \(m^*\). Then we change the scoring rule s and repeat the four stages eight times, so that again each model can be the designated DGP \(m^*\). We do this four times, so that s spans all four scoring rules and every time we change the scoring rule we repeat stages 1–4 eight times, so that each of the eight models can be the designated DGP. Finally, we change the data set and repeat the entire process again.

We evaluate the scoring rules at the first date of each quarter in our evaluation period. This yields new simulations (each with eight dimensions) on 50, 66, and 82 dates in USD-denominated exchange rates, US interest rates and Bloomberg investable commodity indices, respectively. Since we have eight possible DGPs, this leads to approximately 1600 applications of the simulation study above for each of the four multivariate scoring rules.

This setting gives us a very detailed view on the discrimination ability for each scoring rule over time (i.e. different market conditions) and for various choices of the DGP. Note that our simulation design provides optimistic conditions for the scoring rules since it knows the distribution of the DGP for certain, and moreover it samples a very large number of realizations at each time t.Footnote 8 In practice, we only observe one realization and therefore must consider the scores over a large period—and the longer the sample used the less likely the DGP is to be stationary. To approximate a more realistic setting, in which the length of the out-of-sample period is restricted due to lack of data, we shall also compare the scoring rules on smaller sub-samples, with only 100 realisations instead of 5000. Recall that all models use historical information to forecast their joint distribution, but in our setting they are evaluated using samples from the chosen DGP, rather than realisations of the original time series. The models are only punished if their forecast deviates from that of the true DGP and a good scoring rule should be able to distinguish misspecified models from the true distribution.

4.4 Discrimination metrics

We shall generalize the approach of Pinson and Tastu (2013) to compare the discrimination ability of several scoring rules and introduce a new metric, the error rate, as an additional measure of the sensitivity of scoring rules. Suppose there are M models in the study and that scores are based on N realisations at time t. The \(N \times 1\) score vector assigned by scoring rule s to model m at time t, when model \(m^{*}\) is the DGP is defined as:

$$\begin{aligned} {\mathbf {S}}_{t}^{s}(m,m^{*})=\left( S_{1,t}^{s}(m,m^{*}),\ldots ,S_{N,t}^{s}(m,m^{*})\right) '. \end{aligned}$$
(10)

We employ several metrics aimed at summarizing the properties of (10). First, and following Scheuerer and Hamill (2015), the simplest metric we employ is the mean relative score:

$$\begin{aligned} \frac{1}{N}\sum _{i=1}^{N}\left( S_{it}^{s}\left( m,m^{*}\right) /S_{it}^{s}\left( m^{*},m^{*}\right) \right) . \end{aligned}$$
(11)

However, by construction, the mean relative score may yield erroneous rankings when N is small. For some realisations, the lowest score may be assigned to a model that is not the DGP. We study this probability in our simulation study by introducing a new ‘error rate’ metric and also by analysing the distribution of the absolute differences between the scores assigned to each model:

$$\begin{aligned} {\mathbf {S}}_{t}^{s}\left( m,m^{*}\right) -{\mathbf {S}}_{t}^{s}\left( m^{*},m^{*}\right) . \end{aligned}$$
(12)

As an additional measure for the discrimination ability, we consider a simple heuristic that examines the relative distance of the scores between the models. This approach has been suggested by Pinson and Tastu (2013) who compare the sensitivity of the energy score to various misspecifications. Given

$$\begin{aligned} \overline{{\mathbf {S}}}_{t}^{s}\left( m,m^{*}\right) {:}{=}\frac{1}{N}\sum _{i=1}^{N}S_{it}^{s}\left( m,m^{*}\right) , \end{aligned}$$

they utilize a true Gaussian distribution and measure the sensitivity pairwise through

$$\begin{aligned} \frac{\overline{{\mathbf {S}}}_{t}^{s}\left( m,m^{*}\right) -\overline{{\mathbf {S}}}_{t}^{s}\left( m^{*},m^{*}\right) }{\overline{{\mathbf {S}}}_{t}^{s}\left( m^{*},m^{*}\right) } \end{aligned}$$

with misspecified models that deviate from the true distribution in only in one aspect, e.g. mean or variance. For our purposes we need to adjust their measure to consider the discrimination across scoring rules over multiple misspecified DGPs. To this end, we propose a generalized discrimination heuristic that is defined as

$$\begin{aligned} d_{t}^{s}(m^{*})=\frac{1}{M}\sum _{m=1}^{M}\frac{\overline{{\mathbf {S}}}_{t}^{s}\left( m,m^{*}\right) }{\overline{{\mathbf {S}}}_{t}^{s}\left( m^{*},m^{*}\right) }. \end{aligned}$$
(13)

A large value for this generalized discrimination metric may indicate that the ranking of the scores is reliable and robust to differences in the simulation sample size N.

5 Simulation results

We analyse the energy score and three parametrisations of the variogram score with respect to their ability to identify the DGP under the simulation design described in the previous section. As discussed in Sect. 4.3, we use \(N=5000\) and \(M=8\). To examine the discrimination ability of each scoring rule at time t, we apply the entire sample of 5000 scores but also smaller sub-samples with \(N = 100\) scores that reflect more realistic conditions. These correspond to out-of-sample evaluations with 5000 and 100 observations respectively. Each score assesses the closeness of the distribution forecasts to the distribution of the DGP, rather than indicating real forecasting accuracy. Because we impose a distribution through the DGP, a model’s scores only measure how similar the model is to the DGP, according to that score.

We begin our results in Sect. 5.1 with the mean relative score (11), focusing on exchange rate returns and a DCC-GARCH as DGP. Then, in Sect. 5.2 we generalize these results to multiple DGPs and data sets by using the error rate to measure the percentage of cases in which a misspecified model receives a lower (i.e. better) score than the DGP. Further, we analyse the deviation between the scores of misspecified models and that of the DGP, derived from (12). Finally, in Sect. 5.3, we illustrate the distribution of these deviations and use our generalization (13) of the discrimination heuristic proposed by Pinson and Tastu (2013) that allows us to compare multiple scoring rules.

5.1 Mean relative score

We begin the analysis of the multivariate scoring rules with a comparison of their mean relative score (11) for each model. Figure 3 uses a DCC-GARCH as DGP for exchange rate returns for four selected models, so the DGP (\(m^*\)) in this case is DCC-GARCH. The mean relative score corresponds to the average of the ratio between the score of the misspecified model and that of the DGP. The shaded areas cover everything between the 0.25- and 0.75-quantiles for the sample mean based on a sample size of 100 instead of 5000. These confidence intervals are generated through a statistical bootstrap with 5000 repetitions. We limit the illustration to four models only for clarity, but the results are comparable when other misspecified models, DGPs or data sets are considered.

Fig. 3
figure 3

Average scores relative to score of DGP (USD exchange rates). The figure illustrates the mean relative score in Eq. (11) based on 5000 scores. A value larger than 1 means that the scoring rule is on average able to distinguish between the misspecified model and DCC-GARCH to identify the true DGP. We generate a confidence interval covering the area between the 0.25- and 0.75-quantiles of the sample mean based on a sample size of 100 through bootstrap with 5000 repetitions

The results based on the sample mean of 5000 scores indicate that all four scoring rules manage to evaluate the models successfully. Due to propriety, they assign the lowest expectation to the DGP which is why almost none of the mean relative scores fall below 1 in Fig. 3. Further, CCC-GARCH almost always obtains the lowest score among all misspecified models as expected given its similarity to DCC-GARCH . The scores can distinguish distributions which differ only in their marginals and they are able to identify \(\hbox { FQ-AL}\ _{250}^{C}\) as one of the misspecified models with great confidence. In contrast, the difference between \(\hbox { FQ-AB}\ _{250}^{C}\) and EDF\(_{250}^{C}\) is less pronounced which means that they produce predictions which deviate from the distribution forecast of DCC-GARCH by a similar amount.

Both the variogram score with \(p=0.5\) and \(p=1\) show clear and robust rankings between the misspecified models and distinguish them well from the DGP. The discrimination ability is weaker for the energy score. As pointed out by Pinson and Tastu (2013) and Scheuerer and Hamill (2015), the energy score changes only by a small amount between the DGP and other models. This is evident in Fig. 3 as well, where the average score of the worst model is only 25% larger than that of the DGP. In comparison, the variogram scores with \(p=0.5\) and \(p=1\) assign average scores over 200% and 100% larger than that of the DGP respectively. Unlike the other scoring rules, the variogram score with \(p=2\) changes the rankings at several times and is also the only scoring rule which makes wrong inferences even with a large sample size of 5000 scores. For instance, FQ-AB\(_{250}^{C}\) is preferred over the DGP around the end of 2017. Hence, the energy score and variogram scores with \(p=0.5\) and \(p=1\) may be preferable to the variogram score with \(p=2\).

However, there are vast differences in the discrimination ability of scoring rules which can lead to wrong inferences in smaller sample sizes:

  1. 1.

    Despite the overall success of the variogram score with \(p=0.5\) and \(p=1\), wrong inferences may occur with only 100 samples. The shaded areas of CCC-GARCH dip below 1 frequently which means that a slightly misspecified model may be chosen over the DGP.

  2. 2.

    The energy score suffers from misidentification risk to a much larger extent. Besides CCC-GARCH , \( \text{ FQ-AB }\ _{250}^{C}\) and EDF\(_{250}^{C}\) are also at times assigned lower scores than the DGP in 2010, 2013, 2016 and 2017 with the smaller sample size. Overall though, the energy score still manages to produce a clear ranking that is mostly accurate.

  3. 3.

    The variogram score with \(p=2\) largely fails to yield any meaningful results with the smaller sample size. The rankings can change considerably, and all models obtain a lower sample mean than the DGP at various times. Even \(\hbox { FQ-AL}\ _{250}^{C}\), which is regarded as the worst model by all other scoring rules, has lower scores than DCC-GARCH around 2016. Additionally, the variogram score with \(p=2\) may assign scores of very large magnitude that greatly affect the sample mean. This is visible in Fig. 3 in two aspects: (i) The scoring rule has wide confidence intervals and (ii) the sample mean is at times higher than the sample 0.75-quantile. This is, for instance, the case around the end of 2013.

These initial findings suggest that variogram scores with \(p=0.5\) and \(p=1\) offer superior discrimination ability to the more popular energy score. The variogram score with \(p=2\) performs very poorly and may yield erroneous rankings of the forecasting models, even with a very large sample of scores.

5.2 Error rate comparison

Figure 4 shows the results from applying (12) with DCC-GARCH as DGP for all datasets and all competing models and uses the scores for all t to generate the density. Each column of the figure illustrates the density of Eq. (12) for a specific misspecified model, under various scoring rules and data sets. We include the error rate in the upper right corner of each sub-figure which shows the probability that Eq. (12) yields a negative value. For clarity, we do not use the same x-axis for all sub-figures but show all values between the 0.001- and 0.999-quantiles of each distribution. This means that the magnitude of the error is not visible in these figures but instead we gain insight to the shape of the error density.

Fig. 4
figure 4

Density of differences between scores with DCC-GARCH as DGP. This figure displays the density of the difference between the scores of the DGP and the misspecified models described in Eq. (12). A Gaussian kernel is used to smooth the densities. The shaded areas correspond to negative values, where a lower score is assigned to the misspecified models. In the upper right corner of each sub-figures, the probability of the shaded area is displayed. The dotted vertical line shows the expectation of the density. For clarity, we limit the sub-figure to values between the 0.001- and 0.999-quantiles

Overall, Fig. 4 shows that the probability of getting scores which are lower than that of the DGP is quite high and varies significantly across cases. Depending on the data set and scoring rule, the average error rate (for each row of plots) varies between 31 and 54%. The variogram score with \(p=2\) in particular often assigns lower scores to misspecified models. This happens in 50, 54, 53% of cases for exchange rate returns, interest rate changes and commodity rate returns respectively and is therefore around 60% worse than the error rate of the variogram score with \(p=0.5\) (with corresponding error rates of 31, 32 and 32%). This scoring rule achieves the lowest error rate, followed by the variogram score with \(p=1\) and the energy score, which both achieve error rates in the 39–42% range.

It is important to note that the error rate is only a binary statistic which does not take into account the magnitude by which the scores of misspecified models are smaller than that of the DGP. By averaging over a sample of scores, the error rate decreases, until it reaches zero due to the propriety of the scoring rules. The number of samples needed for a sample mean that favours the DGP depends on the shape of the distribution. If the tail of the shaded area is small in comparison to the tail of the non-shaded area, a small sample might be sufficient. However, many of the distributions in Fig. 4 are approximately symmetric which means that large positive and negative values in Eq. (12) are equally likely. As an additional indicator for the convergence speed, we illustrate the expectation of the distributions with a dotted line. These are always non-negative due to the propriety of the scoring rules, but an expectation far right from the shaded area corresponds to a faster convergence towards lower mean relative scores for the DGP. Again, the values are generally close to zero, which suggests slow convergence towards positive sample mean scores.

Fig. 5
figure 5

Error rates of scoring rules. The error rates show how often a misspecified model is assigned a lower score than the DGP. Higher values are associated with inferior scoring rules and more frequently wrong inferences

The average error rate over all DGPs for the evaluation period of the multivariate scoring rules is compared in Fig. 5. Similar to Fig. 4, we examine the number of times the score of a misspecified model is lower than that of the DGP but we now include the error rate across multiple choices of the DGP. The conclusions from Fig. 5 are similar to Fig. 4, although the magnitude of the error rates is lower than for the case of only DCC-GARCH as DGP. The variogram score with \(p=2\) has a significantly higher error rate that is more than 47% higher than that of the variogram score with \(p=0.5\). This is also consistent with Fig. 3 where misspecified models were sometimes preferred over the DGP. Again, there is typically a clear ranking of the scoring rules that persists with all three data sets and the entire evaluation period. For the variogram scores, the error rate increases with the parameter p and the error rate of the energy score typically falls between the variogram score with \(p=1\) and \(p=2\), but interestingly it often outperforms variogram with \(p=1\) for the interest rate data.

5.3 Generalized discrimination heuristic

Through the consideration of multiple models, we go beyond ceteris paribus sensitivities to obtain more general results. Our misspecified models combine various misspecifications at once and are therefore more similar to the settings under which the proper scoring rules are applied in practice. Our generalized discrimination heuristic defined in (13) allows us to test discrimination ability against multiple competing models at once. Note that we do not subtract the scores of the DGP from those of the misspecified models in the numerator, but this does not affect the rankings of the scoring rules. Figure 6 depicts our discrimination heuristic for each scoring rule over time with a logarithmic scale. Unlike Fig. 3, results are illustrated for all DGPs (eight rows of plots) and all data sets (three columns of plots).

Fig. 6
figure 6

Discrimination heuristic of scoring rules. We display the discrimination heuristic of Eq. (13) for all three data sets and eight DGPs with a logarithmic scale. Scoring rules which separate the scores of misspecified models and the DGP by a larger relative distance are assigned higher values for the discrimination heuristic. We smooth the discrimination heuristic with a moving average of 8 observations to improve the interpretability of the figure, but the same patterns are present in case no smoothing is applied

Figure 6 shows a clear distinction between scoring rules in most cases, but the preferences of the discrimination heuristic for the four rules can vary slightly depending on the data set, the time period and the DGP chosen. Overall, there are several key features which emerge:

  1. 1.

    The energy score is always the scoring rule with the lowest discrimination heuristic. This, again, is in accord with prior simulation studies by Pinson and Tastu (2013) and Scheuerer and Hamill (2015). Across all data sets and DGPs, the energy score only receives an average discrimination heuristic of 1.23, compared to 2.79, 5.30 and 78.13 for the variogram score with \(p=0.5\), \(p=1\) and \(p=2\).

  2. 2.

    In all cases, the variogram score with \(p=1\) is the scoring rule with the second highest discrimination heuristic.

  3. 3.

    In some settings the variogram score with \(p=2\) achieves extremely high values for the discrimination heuristic, but is also the only scoring rule which receives values below 1. This occurs in commodity index returns with DCC-GARCH as DGP. For those t, the model ranking of the variogram score with \(p=2\) is erroneous and multiple misspecified models receive lower scores than the DGP;

  4. 4.

    The scoring rule with the highest discrimination heuristic varies depending on the choice of data and DGP but exhibits a pattern. In most cases, the variogram score with \(p=0.5\) has the highest discrimination heuristic, but it is surpassed by the variogram score with \(p=2\) during some periods and always when FQ-AL models are used as DGP.

The high discrimination heuristic of some variogram scores with \(p=2\), despite the poor performance in Fig. 3 can be explained by Fig. 2 and our discussion on the effect of different choices of p for the variogram score. Generally, the variogram score with \(p=2\) outputs a large range of scores, some of which may be vastly larger in magnitude than others. These outliers shift the sample mean in Fig. 3 to a larger value than the sample 0.75-quantile and also affect the discrimination heuristic to a similar extent. For instance, in exchange rate returns with DCC-GARCH as DGP, the largest summand of Eq. (13) takes a value around 4700. In comparison, the largest summand of the energy score, variogram score with \(p=0.5\) and \(p=1\) are 17, 76 and 141 respectively.

The quadratic nature of the variogram score with \(p=2\) further amplifies large distances between models. Therefore, the variogram score with \(p=2\) achieves a particularly high discrimination heuristic when the models are easily distinguishable. The cases where the variogram score with \(p=2\) have the highest discrimination heuristic mostly correspond to two scenarios:

  1. 1.

    Around the financial crisis in 2008, the differences of the distribution forecasts become easier to distinguish. This is because models with a calibration window of 2000 observations are much less affected by the abnormal values during the crisis in contrast to models with a calibration window of only 250 observations. Hence, the distribution forecasts may deviate more strongly between the competing models and scoring rules may assign larger relative distances between the scores of misspecified models and those of the DGP.

  2. 2.

    Similarly, the use of FQ-AL as DGP also increases the relative distances between the scores of the models. The Factor Quantile model typically produces a much sharper forecast than that of alternative models and is therefore more easily identifiable as DGP.

In those two cases, all scoring rules manage to identify the DGP from misspecified models quite clearly, so the even larger relative distance between the scores of the variogram score with \(p=2\) arguably has no additional benefit. Simultaneously, the scoring rule suffers from erroneous rankings, despite having high discrimination heuristics in some settings. These issues clearly reveal that the discrimination heuristic should be considered as one possible indicator for the goodness of scoring rules, but by itself is inadequate to quantify their discrimination ability. A large heuristic of a scoring rule may not imply more robust or less erroneous rankings. Therefore, a high discrimination heuristic between the models is not useful unless it is accompanied by a low error rate, as given by the percentage of times the score chooses a misspecified model over the true DGP.

6 Conclusions

Although all proper scoring rules enable honest assessment of forecasting performance, they inevitably have different geometry and finite sample properties. However, the comparison between proper scoring rules has not been adequately studied in the literature to date, especially in multivariate settings. The main theoretical contribution in this paper is to introduce new discrimination metrics for comparing proper scoring rules. The main practical contribution to apply these metrics in an extensive investigation of the discrimination ability of the most common multivariate proper scoring rules, comparing the performance of four popular choices across a variety of time periods, financial data sets and competing stochastic models. And the main methodological contribution is the novel construction of our simulation study, which rotates the true model across a selection of eight realistic models with different static or dynamic joint distributions and different calibration windows. This way, our study provides a much less restrictive and more comprehensive set of results than the existing literature in this area.

The simulation results do, of course, vary by scenario. However, this finding is consistent with the existing literature in which it is rarely seen that one metric for evaluating the accuracy of a forecast consistently outperforms another metric, on every single scenario—see, for example, Ziel and Berk (2019). Nevertheless, some clear overall properties and insights do emerge. In particular: the energy score generally has poorer discrimination ability than the other scores; the variogram score with \(p=2\) can appear to provide strong discrimination in certain cases but on the other hand is prone to high error rates as it frequently fails to identify the true model that generated the data; and the variogram score with \(p=0.5\) consistently maintains the lowest error rate, even though it is sometimes outperformed by the variogram score with \(p=1\) in terms of discrimination ability. We can explain these results by observing that small differences between models are magnified more when \(p=0.5\), but large differences are easier to detect when \(p=1\) or 2.

Some of our results are corroborated by previous research, albeit that those studies are very much more limited than ours. For instance, Pinson and Tastu (2013) find that the energy score lacks sensitivity to errors in correlation, but only in a bivariate Gaussian system. We find the same result, but now in a much more general setting: using three different eight-dimensional systems and simulating data from models that are considered the best in class for each system. Additionally, our preference for the variogram with \(p=0.5\) is supported by the study of Scheuerer and Hamill (2015), even though that paper only employs some very simple true models.

On the whole, the variogram score with \(p=0.5\) seems to perform best in most settings, but another important conclusion is that one should typically consider multiple performance metrics and be aware that different scoring rules tend to behave differently, depending on the model settings. Undoubtedly, such insights can be of great value when deciding which scoring rule to use in a wide variety of multivariate forecasting applications.

The design of our simulation experiment is, to the best of our knowledge, completely unique. So are the new metrics for discriminating between different performance metrics. Taken together, we believe these pave the way for a new strand of research into risk and uncertainty modelling. In this paper we have applied the simulation study to evaluate the discrimination ability of proper multivariate scoring rules when applied to static and dynamic forecasting models calibrated to systems of foreign exchange rates, interest rates and commodity futures. However, there is a huge scope for additional research: into using other performance metrics not just proper scoring rules; using different static or dynamic models; and/or by applying the simulation design and discrimination metrics to other multivariate systems in both risk management and insurance.