A multiple-baseline design (MBD) is one of the variants of single-subject experimental designs (SSEDs). SSED researchers observe and measure a participant or case repeatedly over time. Observations are obtained during at least one baseline phase (when no intervention is present) and at least one treatment phase (when an intervention is present). By comparing scores from both kinds of phases, SSED researchers can assess whether the outcome scores on the dependent variable changed, for instance, in level or in slope when the treatment was present (Onghena & Edgington, 2005).

In an MBD, an AB phase design (with one baseline phase, A, and one treatment phase, B) is implemented simultaneously to different participants, behaviors, or settings (Barlow & Hersen, 1984; Ferron & Scott, 2005; Onghena, 2005; Onghena & Edgington, 2005). MBDs are popular among SSED researchers (Shadish & Sullivan, 2011) because the intervention is introduced sequentially over the participants (or settings and behaviors), which entails the advantage that the researchers can more easily disentangle effects of the intervention and effects of some external events, such as the illness of a teacher, an exciting class activity, the presence of a foreign observer, or a teacher intern (Baer, Wolf, & Risley, 1968; Barlow & Hersen, 1984; Kinugasa, Cerin, & Hooper, 2004; Koehler & Levin, 2000). This is because, if an external event occurs at certain points in time, the outcome scores for all participants in that study might be simultaneously influenced. Figure 1 gives a graphical presentation of possible consequences for the occurrence of an external event in a multiple-baseline across-participants design. In Fig. 1a, the external event has a constant effect on the dependent variable on subsequent measurements—for instance, the teacher is ill during subsequent days, or there is a foreign observer during some measurement occasions. Figure 1b illustrates a gradually fading away external event effect. For instance, the influence of a teacher intern on the behavior of the students may be reduced over time.

Fig. 1
figure 1

Graphical display of a constant external event effect (a) and a gradually fading away external event effect (b) affecting the score on four subsequent moments (day 17, day 19, day 21, and day 23) for a multiple-baseline design across 3 participants, with the treatment starting on day 6, day 16, and day 24, respectively

Van den Noortgate and Onghena (2003) proposed the use of multilevel models to synthesize data from multiple SSED studies, allowing investigation of the generalizability of the results and exploration of potential moderating effects. In previous research evaluating this multilevel meta-analysis of MBD data (Ferron, Bell, Hess, Rendina-Gobioff, & Hibbard, 2009; Ferron, Farmer, & Owens, 2010; Moeyaert, Ugille, Ferron, Beretvas, & Van den Noorgate, 2012a, 2012b; Owens & Ferron, 2012), the data were typically simulated with a treatment effect and random noise only. Potential confounding events that could have a simultaneous effect on all participants within a study were not taken into account. In this study, we evaluate the performance of the basic three-level model when there are effects of external events, as well as that of an extension of the model that tries to account for potential event effects. In the following, we first present the basic model and a possible extension to account for external events. Next, we evaluate the performance of both models, by means of a simulation study and an analysis of real data.

Three-level meta-analysis

A meta-analysis combines the results of several studies addressing the same research question (Cooper, 2010; Glass, 1976). Study results are typically first converted to a common standardized effect size before meta-analyzing them. The effect sizes may be reported in the primary studies or can be calculated afterward, using reported summary and/or test statistics.

One possible way to calculate effect sizes when SSEDs are used is to analyze the data using regression models and to use the regression coefficients as effect sizes. A regression model of interest here is the one proposed by Center, Skiba, and Casey (1985–1986):

$$ {Y_i}={\beta_0}+{\beta_1}{T_i}+{\beta_2}{D_i}+{\beta_3}T{\prime_i}{D_i}+{e_i}. $$
(1)

The score of the dependent variable on measurement occasion i (Y i ) depends on a dummy coded variable (D i ) indicating whether the measurement occasion i belongs to the baseline phase (D i = 0) or the treatment phase (D i = 1); a time-related variable T i that equals 1 on the first measurement occasion of the baseline phase; and an interaction term between the centered time indicator and the dummy variable, \( T{\prime_i}{D_i} \), where \( T{\prime_i} \) is centered such that \( T{\prime_i} \) equals 0 on the first measurement occasion of the treatment phase. \( {\beta_0} \) indicates the expected baseline level \( {\beta_1} \) is the linear trend during the baseline, \( {\beta_2} \) refers to the immediate treatment effect, and \( {\beta_3} \) refers to the effect of the treatment on the time trend.

Van den Noortgate and Onghena (2003) proposed using the ordinary least squares estimates for \( {\beta_2} \) and \( {\beta_3} \) from Eq. 1 as effect sizes in the three-level meta-analysis. At the first level, the estimated effect sizes of the immediate treatment effect, b 2jk , and the treatment effect on the time trend, b 3jk , for participant j from study k are equal to the unknown population effect sizes, β 2jk and β 3jk , respectively, plus random deviation s,r 2jk and r 3jk , that are assumed to be normally distributed with a mean of zero:

$$ \begin{array}{*{20}c} {{b_{2jk }}={\beta_{2jk }}+{r_{2jk }}} \hfill & {\mathrm{with}} \hfill & {{r_{2jk }}\sim N\left( {0,\sigma_{{{r_{2jk}}}}^2} \right)} \hfill \\ {{b_{3jk }}={\beta_{3jk }}+{r_{3jk }}} \hfill & {\mathrm{with}} \hfill & {{r_{2jk }}\sim N\left( {0,\sigma_{{{r_{3jk}}}}^2} \right)} \hfill \\ \end{array} $$
(2)

The sampling variances of the observed effects, \( \sigma_{{{r_{2jk}}}}^2 \) and \( \sigma_{{{r_{3jk}}}}^2, \) are the squared standard errors that are typically reported by default when a regression analysis is performed. These variances depend to a large extent on the number of observations and the variance of these observations and, therefore, can be participant and study specific. At the second level, the population effect sizes \( {\beta_{2jk }} \) and \( {\beta_{3jk }} \) from Eq. 2 can be modeled as varying over participants around the study-specific mean effect, \( {\theta_{20k }} \) and \( {\theta_{30k }} \) (Eq. 3):

$$ \begin{array}{*{20}c} {{\beta_{2jk }}={\theta_{20k }}+{u_{2jk }}} \hfill & {\mathrm{with}} \hfill & {{u_{2jk }}\sim N\left( {0,\sigma_{{{u_{2jk}}}}^2} \right)} \hfill \\ {{\beta_{3jk }}={\theta_{30k }}+{u_{3jk }}} \hfill & {\mathrm{with}} \hfill & {{u_{3jk }}\sim N\left( {0,\sigma_{{{u_{3jk}}}}^2} \right)} \hfill \\ \end{array} $$
(3)

The population effects for studies can vary between studies (third level, Eq. 4):

$$ \begin{array}{*{20}c} {{\theta_{20k }}={\gamma_{200 }}+{v_{20k }}} \hfill & {\mathrm{with}} \hfill & {{v_{20k }}\sim N\left( {0,\sigma_{{{v_{20k}}}}^2} \right)} \hfill \\ {{\theta_{30k }}={\gamma_{300 }}+{v_{30k }}} \hfill & {\mathrm{with}} \hfill & {{v_{30k }}\sim N\left( {0,\sigma_{{{v_{30k}}}}^2} \right)} \hfill \\ \end{array} $$
(4)

The model parameters that we are typically interested in when using a multilevel model are the fixed effects regression coefficients (i.e., \( {\gamma_{200 }} \), referring to the average immediate treatment effect over participants and studies, and \( {\gamma_{300 }} \), referring to the average treatment effect on the linear trend over participants and studies in Eq. 4) and the variances (i.e., \( \sigma_{{{v_{20k}}}}^2 \), referring to the between-study variance for the estimated immediate treatment effect; \( \sigma_{{{v_{30k}}}}^2 \), indicating the between-study variance for the estimated treatment effect on the time trend; \( \sigma_{{{u_{2jk}}}}^2 \), the between-case variance for the estimated immediate treatment effect; and \( \sigma_{{{u_{3jk}}}}^2 \), referring to the between-case variance of the estimated treatment effect on the time trend).

Correcting effect sizes for external events

External events in a multiple-baseline across-participants design can have an effect on the outcome score(s) of all participants within a study. These external event effects are common in SSEDs, because practitioners often implement these designs in their everyday setting (for example, in the home, school, etc.), where they cannot control for outside experimental factors (Christ, 2007; Kratochwill et al., 2010; Shadish, Cook, & Campbell, 2002). If we do not model these external events, the results might be biased. For instance, suppose that a researcher is interested in a change in challenging behavior and staggers the beginning of the treatment across 3 participants. The 3 participants receive the treatment at day 6, day 16, and day 24, respectively (see Fig. 1) and are observed every 2 days. On days 17, 19, 21, and 23, the teacher is ill, and as a consequence, a substitute teacher takes his or her place, and the participants exhibit more challenging behavior. In this situation, the estimated treatment effect for participants 1 and 2 will be smaller, and the estimated treatment effect for participant 3 will be larger, and therefore differences between participants in the treatment effects are also likely to be overestimated, unless we correct the effect sizes for possible external events.

A possible way to calculate effect sizes corrected for an external event in an SSED is by estimating effect sizes for participants per study, by performing a regression analysis with a model including possible event effects, and by assuming that external events simultaneously affect all participants in a study. Thereafter, the corrected effect sizes can be combined over studies in the three-level meta-analysis.

For the first step, we propose to use an extension of the Center et al. (1985–1986) model, including dummy variables for measurement occasions:

$$ {Y_{ij }}={\beta_{0j }}+{\beta_{1j }}{T_{ij }}+{\beta_{2j }}{D_{ij }}+{\beta_{3j }}{D_{ij }}T_{ij}^{\prime }+\sum\nolimits_{m=2}^{I-1 } {{\beta_{{\left( {m+2} \right)}}}{M_{mi }}+{e_{ij }}.} $$
(5)

The score on the dependent variable Y on measurement occasion i (= 1, 2, . . . , I) from participant j (= 1, 2, . . . , J) is modeled as a linear function of the dummy-coded variable (D ij ) indicating whether the measurement occasion i from participant j belongs to the baseline phase (D ij = 0) or the treatment phase (D ij = 1); a time-related variable T ij , which equals 1 at the start of the baseline phase; an interaction term between the dummy variable indicating the phase and the time indicator centered around its value at the start of the treatment phase, \( {D_{ij }}T_{ij}^{\prime } \); and finally, dummy-coded variables indicating the moment (M mi = 1 if m = i, zero otherwise). By including the effects of individual moments, coefficients β 2j and β 3j can be interpreted as the treatment effects, corrected for possible external events.

We do not include a dummy variable for one measurement moment in the baseline phase and one measurement moment in the treatment phase. This is to ensure that the model is identified; if we included these parameters as well, an increase in the effects for each moment in the baseline phase could be compensated for by a decrease of the intercept, illustrating that without constraining these parameters, there would be an infinite number of equivalent solutions. For our study, we select the first and last moments as the times at which to set the moment effects to zero, but different moments can be chosen if we suspect a moment effect during one of these times.

While the baseline level and slope (β 0j and β 1j ) and both treatment effects (β 2j and β 3j ) are participant specific, the moment effects are assumed to be the same for all participants from the same study and, therefore, have to be estimated for each study, using all data from that study. To this end, we propose to extend Eq. 5 by including a set of dummy participant indicators. For 2 participants, using dummy participant indicators P 1 and P 2, respectively, this results in Eq. 6:

$$ {Y_{ij }}={\beta_{01 }}{P_{1j }}+{\beta_{02 }}{P_{2j }}+{\beta_{11 }}{T_{i1 }}{P_{1j }}+{\beta_{12 }}{T_{i2 }}{P_{2j }}+{\beta_{21 }}{D_{i1 }}{P_{1j }}+{\beta_{22 }}{D_{i2 }}{P_{2j }}+{\beta_{31 }}{D_{i1 }}T_{i1}^{\prime }{P_{1j }}+{\beta_{32 }}{D_{i2 }}T_{i2}^{\prime }{P_{2j }}+\sum\nolimits_{m=2}^{I-1 } {{\beta_{{\left( {m+2} \right)}}}{M_{mi }}+{e_{ij }}} $$
(6)

After using Eq. 6 for each study to estimate the corrected effect sizes (β 2j and β 3j ) for each participant, we can use the three-level meta-analysis (see Eqs. 24) to combine the corrected effect size estimates from multiple participants. In principle, we could also use a two-level model per study to estimate the participant-specific effects, but given the typically very small number of participants per study, using a multilevel model might not be recommended.

A simulation study

Simulating three-level data

To evaluate the performance of the basic model and its extension, we performed a simulation study. We simulated raw data using a three-level model. At level 1, we used the following model:

$$ \begin{array}{*{20}c} {{Y_{ijk }}={\beta_{0jk }}+{\beta_{1jk }}{T_{ijk }}+{\beta_{2jk }}{D_{ijk }}+{\beta_{3jk }}{T_{ijk }}{D_{ijk }}+{e_{ijk }}} & {\mathrm{with}} & {{e_{ijk }}\sim N\left( {0,\sigma_e^2} \right)} \\ \end{array}, $$
(7)

with measurement occasions nested within participants, which form the units at level two:

$$ \begin{array}{*{20}c} {\left\{ {\begin{array}{*{20}c} {{\beta_{0jk }}={\theta_{00k }}+{u_{0jk }}} \\ {{\beta_{1jk }}={\theta_{10k }}+{u_{1jk }}} \\ {{\beta_{2jk }}={\theta_{20k }}+{u_{2jk }}} \\ {{\beta_{3jk }}={\theta_{30k }}+{u_{3jk }}} \\ \end{array}} \right.} & {\mathrm{with}} & {\left[ {\begin{array}{*{20}c} {{u_{0jk }}} \hfill \\ {{u_{1jk }}} \hfill \\ {{u_{2jk }}} \hfill \\ {{u_{3jk }}} \hfill \\ \end{array}} \right]} \\ \end{array}\sim N\left( {0,{\varSigma_u}} \right). $$
(8)

The participants are, in turn, clustered within studies at the third level:

$$ \begin{array}{*{20}c} {\left\{ {\begin{array}{*{20}c} {{\theta_{00k }}={\gamma_{000 }}+{v_{00k }}} \\ {{\theta_{10k }}={\gamma_{100 }}+{v_{10k }}} \\ {{\theta_{20k }}={\gamma_{200 }}+{v_{20k }}} \\ {{\theta_{30k }}={\gamma_{300 }}+{v_{30k }}} \\ \end{array}} \right.} & {\mathrm{with}} & {\left[ {\begin{array}{*{20}c} {{v_{00k }}} \hfill \\ {{v_{10k }}} \hfill \\ {{v_{20k }}} \hfill \\ {{v_{30k }}} \hfill \\ \end{array}} \right]} \\ \end{array}\sim N\left( {0,{\varSigma_v}} \right). $$
(9)

Varying parameters

On the basis of a thorough overview of 809 SSED studies, Shadish and Sullivan (2011) enumerated some parameters that characterize SSEDs. On the basis of their results and our reanalyses of meta-analyses of SSEDs (Alen, Grietens, & Van den Noortgate, 2009; Denis, Van den Noortgate, & Maes, 2011; Ferron et al., 2010; Kokina & Kern, 2010; Shadish & Sullivan, 2011; Shogren, Faggella-Luby, Bae, & Wehmeyer, 2004; Wang, Cui, & Parrila, 2011), we decided to vary the following parameters that can have a significant influence on the quality of model estimation:

  • \( {\gamma_{200 }} \), represents the immediate treatment effect on the outcome and had values 0 (no effect) or 2.

  • The treatment effect on the time trend, defined by \( {\gamma_{300 }} \), was varied to have values 0 (no effect) or 0.2.

  • The regression coefficients of the baseline, \( {\gamma_{000 }} \) and \( {\gamma_{100 }} \), did not vary and were set at 0, because the interest is in the overall treatment effects (e.g., the immediate treatment effect and the treatment effect on the time trend).

  • The number of simulated participants, J, equaled 4 or 7.

  • The number of measurements within a participant, I, was 15 or 30. We chose to keep I constant for all participants within the same study.

  • The number of studies, K, was 10 or 30.

  • The between-case covariance matrix: Covariances between pairs of regression coefficients were set to zero. Therefore, \( {\varSigma_u} \) is a diagonal matrix: \( \sum\nolimits_u {=\mathrm{diag}\;\left( {\sigma_{u_0}^2,\sigma_{u_1}^2,\sigma_{u_2}^2,\sigma_{u_3}^2} \right)} =\mathrm{diag}\;\left( {2,0.2,2,0.2} \right) \) or \( \sum\nolimits_u {=\mathrm{diag}\;\left( {\sigma_{u_0}^2,\sigma_{u_1}^2,\sigma_{u_2}^2,\sigma_{u_3}^2} \right)} =\mathrm{diag}\;\left( {0.5,0.05,0.5,0.05} \right) \).

  • The between-study covariance matrix: Covariances between pairs of regression coefficients were set to zero. Therefore, \( {\varSigma_v} \) is a diagonal matrix: \( \sum\nolimits_v {=\mathrm{diag}\;\left( {\sigma_{v_0}^2,\sigma_{v_1}^2,\sigma_v^2,\sigma_{v_3}^2} \right)} =\mathrm{diag}\;\left( {2,0.2,2,0.2} \right) \) or \( \sum\nolimits_v {=\mathrm{diag}\;\left( {\sigma_{v_0}^2,\sigma_{v_1}^2,\sigma_{v_2}^2,\sigma_{v_3}^2} \right)} =\mathrm{diag}\;\left( {0.5,0.05,0.5,0.05} \right) \).

  • The moment of introducing a treatment effect was staggered across participants within a study (see Table 1), depending on the number of measurements.

    Table 1 Time of introducing the treatment

In a first scenario, a constant external event was added to influence four subsequent scores of all the participants within a study (as in Fig. 1a). The moment was randomly generated from a uniform distribution for each study separately. Because we did not include a moment effect for the first and the last moments to make the model identified, the external event effect did not occur on these moments. The external event effect was 0 or 2, representing a null and a large external event effect, respectively.

In a second scenario, the effect of the external event effect was added, which fades away gradually (see Fig. 1b) for all the participants within a study. The effect across four time points was 3.5, 2.5, 1.5, 0.5, and 0, respectively, so that, on average, the overall effect was the same as in the first scenario. The start of the event effect was generated completely at random from a uniform distribution for each study separately, so that the external event effect did not occur on the first or last measurement occasion. Data were generated using SAS.

Analysis

We had a total of 29 (= 512) experimental conditions. We simulated 400 replications of each condition, resulting in 204,800 data sets to analyze. We analyzed the data twice and compared the results. First, we combined the uncorrected effect sizes in the three-level meta-analysis. Next, we analyzed the three-level data by estimating the corrected effect sizes, β 2j and β 3j , using the regression analysis per study (see Eq. 5) before combining them in the three-level meta-analysis (see Eqs. 24).

In the two approaches, we used the SAS proc MIXED (Littell, Milliken, Stroup, Wolfinger, & Schabenberger, 2006) procedure to estimate the participant-specific effect sizes, β 2jk and β 3jk . In the first approach, the effect sizes were uncorrected for the external event effect, whereas the effect sizes in the second approach were corrected.

SAS proc MIXED was also used for the three-level meta-analysis. The Satterthwaite approach to estimating the degrees of freedom method was applied because this method provides more accurate confidence intervals for estimates of the average treatment effect for two-level analyses of multiple-baseline data (Ferron et al., 2009).

In order to evaluate the appropriateness of both models, uncorrected and corrected for external events, we calculated the deviations of the estimated immediate treatment effect, \( {{\widehat{\gamma}}_{200 }} \), from its population value, \( {\gamma_2}_{00 } \), and the deviations of the estimated treatment effect on the time trend, \( {{\widehat{\gamma}}_{300 }} \), from its population value, \( {\gamma_3}_{00 } \). The mean deviation gives us an idea of the bias. Next, we calculated the mean squared deviation (the mean squared error [MSE]), which gives information about the variance of both estimated treatment effects (\( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \)) around the corresponding population effect (\( {\gamma_2}_{00 } \) and \( {\gamma_3}_{00 } \)). Furthermore, we discuss the standard error and the 95 % confidence interval coverage proportion (CP) of the estimated immediate treatment effect and the treatment effect on the time trend. We also evaluate the bias of the point estimates of the between-study and between-case variances.

We used ANOVAs to evaluate whether there were significant effects (α = .01) of each model type (e.g., model using effect sizes corrected vs. uncorrected for external event effects) and of the simulation design parameters (\( {\gamma_2}_{00 } \), \( {\gamma_3}_{00 } \), K, I, J, \( \sigma_{u_2}^2 \), \( \sigma_{v_2}^2 \)) on the bias, the MSE, the standard error, and the CP.

Results of the simulation study

We present the results in two sections. In the first section, we discuss the constant external effect over four subsequent measurement occasions. The second section considers the case where the external effect gradually fades away over four subsequent measurements. Each section presents the results of the three-level analysis of uncorrected and corrected effect sizes.

When there is no external event effect, the results of the three-level meta-analysis (i.e., bias in the fixed effects, MSE of the fixed effects, estimated standard errors of the fixed effects, CP for the fixed effects, and bias in the variance components) were found to be independent of the model type (corrected or not corrected for external events).

We found no significant bias for \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) when using the corrected or uncorrected model. Therefore, we discuss the results of the analyses of the data including only external event effects conditions.

Constant external event over four subsequent measurement occasions

Overall treatment effect

Bias

When we estimate \( {\gamma_2}_{00 } \) and the effect sizes are uncorrected, the estimated treatment effect is, on average, significantly larger than the population value (\( {\gamma_2}_{00 } \) = 0 or 2). Over all conditions, the bias equals 0.032, t(51199) = 17.32, p < .0001, whereas there is no significant bias for the corrected effect sizes (−0.0015), t(51199) = −0.96, p = .34. Table 2 presents the bias estimates for \( {{\widehat{\gamma}}_{200 }} \), when \( {\gamma_2}_{00 } \) = 2 and \( {\gamma_3}_{00 } \) = 0.2.

Table 2 The bias of \( {{\widehat{\gamma}}_{200 }} \) in the \( {\gamma_2}_{00 } \) = 2, and \( {\gamma_3}_{00 } \) = 0.2 conditions for the constant external event effect over four subsequent measurement occasions

Similar results are obtained for \( {{\widehat{\gamma}}_{300 }} \). The bias is significantly negative for the uncorrected effect sizes and equals −0.20, t(51199) = −255.27, p < .0001, whereas the bias is not significant for the corrected effect sizes, t(51199) = −0.00020, p = .79. Moreover, an analysis of variance on the deviations reveals a significant difference between the two different models, for both \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) [F(1, 102398) = 192.06, p < .0001 for \( {{\widehat{\gamma}}_{200 }} \), and F(1, 102398) = 33,695.1, p < .0001 for \( \left. {{{\widehat{\gamma}}_{300 }}} \right] \). The differences are largest when there is a small number of measurement occasions (I = 15) and studies (K = 10). In the following condition, the largest difference was identified: \( {\gamma_2}_{00 } \) = 2, \( {\gamma_3}_{00 } \) = 0, K = 10, I = 15, J = 4, \( \sigma_{u_2}^2 \) = 0.5, and \( \sigma_{v_2}^2 \) = 2 (with a difference of 0.23).

MSE

Similar to the bias, the MSE of the estimated treatment effect depends significantly on the model type; using an analysis of variance on the squared deviations, F(1, 102398) = 882.77, p < .0001 for \( {{\widehat{\gamma}}_{200 }} \) and F(1, 102398) = 7,076.91, p < .0001 for \( {{\widehat{\gamma}}_{300 }} \). When using the corrected model, the MSE for \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) equals 0.12 and 0.028, respectively, whereas it is 0.18 and 0.070, respectively, for the uncorrected effect sizes. Differences between both models are larger if the number of observations and the number of studies are small (see Table 3 for \( {{\widehat{\gamma}}_{200 }} \); similar results are obtained for \( {{\widehat{\gamma}}_{300 }} \)). So especially in these conditions, the modified model is recommended.

Table 3 The MSE of \( {{\widehat{\gamma}}_{200 }} \) in the \( {\gamma_2}_{00 } \) = 2, and \( {\gamma_3}_{00 } \) = 0.2 conditions for the constant external event effect over four subsequent measurement occasions

Estimates of the standard errors

In order to evaluate inferences regarding the treatment effects, we constructed confidence intervals around the estimated treatment effects, \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) Therefore, we needed to estimate the standard errors of the estimated treatment effects. Because we obtained 400 estimates of the effects in each condition, the standard deviations of the effect estimates can be regarded as a relatively good estimate of the standard deviation of the sampling distribution and can, therefore, be used as a criterion to evaluate the standard error. We looked at the relative standard error biases, which are the differences between the median standard error estimates and the standard deviation of the estimates of the effect divided by the standard deviation of the estimates of \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \). The relative differences are negative for \( {{\widehat{\gamma}}_{200 }} \), which means that the median standard error estimates are smaller than expected. For \( {{\widehat{\gamma}}_{300 }} \), these differences are positive, referring to median standard error estimates larger than expected. The relative standard error biases for both \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) are, on average, larger across the conditions for the uncorrected effect sizes, in comparison with the corrected effect sizes. For \( {{\widehat{\gamma}}_{200 }} \), the average relative standard error biases equal −1.8 % and −2.0 % for the corrected and uncorrected models, respectively. The average relative standard error biases difference for \( {{\widehat{\gamma}}_{300 }} \) for the uncorrected model is 2 %, whereas it is substantial (more than 10 %; Hoogland & Boomsma; 1998) for the uncorrected model (25.7 %). So the difference between the model types becomes more apparent when \( {\gamma_3}_{00 } \) is estimated, F(1, 254) = 38.9, p < .0001. The conditions with the largest relative standard error bias when the uncorrected model for \( {{\widehat{\gamma}}_{300 }} \) was used tended to coincidence with the conditions where 30 studies, an immediate treatment effect of 2, and a treatment effect on the time trend of 0.2 were involved, with the bias mounting to 107 % in the condition where \( {\gamma_2}_{00 } \) = 2, \( {\gamma_3}_{00 } \) = 0.2, K = 30, J = 7, I = 30, \( \sigma_{v_2}^2 \) = 0.5, and \( \sigma_{u_2}^2 \) = 0.5.

Coverage proportion

We estimated the CP of the 95 % confidence intervals, which allowed us to evaluate the interval estimates of \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \). The confidence intervals were estimated by using the standard errors and the Satterthwaite estimated degrees of freedom. The CP of these confidence intervals was estimated for each of the combinations. A positive significant difference between the corrected model and the uncorrected model in the CP is found for \( {{\widehat{\gamma}}_{200 }} \), F(1, 254) = 27.56, p < .0001 (see Table 4). Also, for \( {{\widehat{\gamma}}_{300 }} \), the mean CP depends significantly on the model type, F(1, 254) = 20.96, p < .0001 (see Table 4). The conditions with a CP less than .93 all have 15 measurements in common and occur when the effect sizes are uncorrected, for both \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \). Moreover, for \( {{\widehat{\gamma}}_{300 }} \), the CP is not only too small when I = 15 and K = 30, but also too large when I = 30 (values for the CP range from .99 to 1.00). When the effect sizes are uncorrected, the CP is well estimated when I = 30 for \( {{\widehat{\gamma}}_{200 }} \) and I = 15 and K = 10 for \( {{\widehat{\gamma}}_{300 }} \). The difference in CP for \( {{\widehat{\gamma}}_{200 }} \) is largest when there is only a small number of measurements (I = 15) and a large number of studies (I = 30).

Table 4 The coverage proportion of \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) in the \( {\gamma_2}_{00 } \) = 2, \( {\gamma_3}_{00 } \) = 0.2, and \( \sigma_{u_2}^2 \) = 2 conditions for the constant external event effect over four subsequent measurement occasions

Variance components

In the three-level analyses, the between-study and between-case variances were estimated for both the immediate treatment effect and the treatment effect on the trend. Because variance estimates are expected to be positively skewed, due to truncation of negative estimates to zero, we calculated the median (relative) deviation of the estimates from the population value, rather than the mean (relative) deviation, to evaluate the (relative) bias in the estimates. We discuss only the between-case variance and the between-study variance of the immediate treatment effect (\( \sigma_{u_2}^2 \) and \( \sigma_{v_2}^2 \)), because similar results are obtained for the treatment effect on the time trend (\( \sigma_{u_3}^2 \) and \( \sigma_{v_3}^2 \)). The bias of the estimated between-study variance and the estimated between-case variance of the immediate effect is larger when there are only 10 studies and 15 measurement occasions involved. The conditions with the largest relative bias all had 15 measurements, 4 participants, and a small between-study variance (\( \sigma_{v_2}^2 \) = 0.5) in common. If the effect sizes are corrected and we estimate the between-study variance of the immediate treatment effect, we find relative parameter bias values across conditions ranging from 17 % to 55 %, while the relative bias goes up to a value of 313 % when the effect sizes are uncorrected. Similar results are found for \( \widehat{\sigma}_{u_2}^2 \), where the relative bias in a condition is maximum 119 % for the corrected effect sizes and 326 % for the uncorrected effect sizes (see Table 5). Overall, the adjusted model results in less biased variance estimates.

Table 5 Median of relative deviation of the variance estimates of \( {\gamma_2}_{00 } \) in the \( {\gamma_2}_{00 } \) = 2, and \( {\gamma_3}_{00 } \) = 0.2 conditions for the constant external event effect over four subsequent measurement occasions

External event fades away gradually over four subsequent measurement occasions

Overall treatment effect

Bias

The bias of \( {{\widehat{\gamma}}_{200 }} \) for uncorrected and corrected effect sizes is, respectively, −0.0073, t(51199) = −418, p < .001, and 0.00057, t(51199) = 38, p = .74. This means that there is a significant negative bias for the uncorrected effect sizes, whereas this is not the case for the corrected effect sizes, and the models differ significantly, F(1, 102398) = 0.009, p = .77. The bias for \( {{\widehat{\gamma}}_{300 }} \), depends largely on the model type, F(1, 102398) = 30,476.1, p < .0001. The bias for the uncorrected effect sizes is significant (−0.19), t(51199) = −246.23, p < .0001, whereas this is not the case for the corrected effect sizes (0.000179), t(51199) = 0.24, p = .81. For both \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \), the difference is largest when there are a small number of measurements (I = 15) involved.

MSE

For both estimated treatment effects, the MSEs are larger for the uncorrected effect sizes, in comparison with the corrected effect sizes (see Table 6). For both \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \), the model type has a significant influence on the MSE, F(1, 102398) = 724.69, p ≤ .0001 for \( {{\widehat{\gamma}}_{200 }} \), and for \( {{\widehat{\gamma}}_{300 }} \), F(1, 102398) = 5,431.15, p < .0001. For both estimated treatment effects, the MSE is large when the studies are heterogeneous (\( \sigma_{v_2}^2 \) = 2) and a small number of measurement occasions (I = 15) and studies (K = 10) are used. The difference between the models is largest when a small number of measurements is used.

Table 6 MSE of \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \) in the \( {\gamma_2}_{00 } \) = 2, \( {\gamma_3}_{00 } \) = 0.2, and \( \sigma_{u_2}^2 \) = 0.5 conditions for the external event effect fading away gradually over four subsequent measurement occasions

Estimates of the standard errors

The difference between the average relative bias in the standard errors of the uncorrected effect sizes equals 0.02 for both uncorrected and corrected effect sizes when \( {\gamma_2}_{00 } \) is estimated.

Similar to the constant external event effect results, the difference between the average relative bias in the standard errors of the uncorrected effect sizes (M = 39.3) and corrected effect sizes (0.06) for \( {{\widehat{\gamma}}_{300 }} \) is larger and statistically significant, F(1, 254) = 129.66, p = .0001 (see Table 7). The difference in results due to the model type is more obvious if there are a small number of studies involved (K = 10).

Table 7 Difference between the median of the standard error estimates and the standard deviation of \( {{\widehat{\gamma}}_{300 }} \) in the \( {\gamma_2}_{00 } \) = 2, \( {\gamma_3}_{00 } \) = 0.2, and \( \sigma_{u_3}^2 \) = 0.05 conditions for the external event effect fading away gradually over four subsequent measurement occasions

Coverage proportion

Similar to the CP for the constant external event effect, the mean CP for the uncorrected and corrected effect sizes for the estimate of the immediate treatment effect differ significantly at the 5 % significance level for both \( {{\widehat{\gamma}}_{200 }} \), F(1, 254) = 3.92, p = .05, and \( {{\widehat{\gamma}}_{300 }} \), F(1, 254) = 3.25, p = .007. The CP with values smaller than .93 all have 15 measurement occasions, have a large between-study variance (\( \sigma_{v_3}^2 \) = 2.0), and occur when the effect sizes are uncorrected (for both \( {{\widehat{\gamma}}_{200 }} \) and \( {{\widehat{\gamma}}_{300 }} \)). Similar to the constant external event effect, the CP is overestimated for \( {{\widehat{\gamma}}_{300 }} \) and when the effect sizes are uncorrected in the condition where 30 measurement occasions are included. In the condition where I = 15 and \( \sigma_{v_3}^2 \) = 2.0, the difference between corrected and uncorrected effect sizes is largest.

Variance components

The results are similar to the results of the constant external event effect, and results are less biased using the adjusted model. We only discuss the estimated variances for the immediate treatment effect, because the results are similar for the estimated treatment effect on the trend. When we estimate the between-study variance and the effect sizes are uncorrected, the bias ranges from −0.002 to 3.41, while it ranges from 0.002 to 0.73 for the corrected effect sizes. So the estimated variances depend on the model type, F(1, 102398) = 1,631, p < .0001. Similar results are obtained for the estimate of the between-case variance. The maximum bias for the corrected effect sizes is 1.60, while it is 3.21 for the uncorrected effect sizes, and these estimates depend on the model type, F(1, 102398) = 5,628.62, p < .0001.

Empirical illustration

In this section, we give empirical illustrations of the comparison of the modified three-level model in which external events are taken into account with the uncorrected model. Therefore, we used a part of the meta-analytic data set of Heyvaert, Saenen, Maes, and Onghena (2012) in which restraint interventions for challenging behavior among persons with intellectual disabilities was investigated. We give two empirical illustrations of the consequences of ignoring the external event effect in a multiple-baseline across-participants design. We illustrate first the consequences of ignoring external events in a single study, and next the consequences of ignoring external events in a three-level meta-analysis.

Ignoring external events in a single study

To illustrate the regression analysis of a multiple-baseline across-3-participants design, we use the study of Thompson, Iwata, Conners, and Roscoe (1999), which was included in the meta-analysis of Heyvaert et al. (2012). In their study, the effects of benign punishment on the self-injurious behavior of individuals who had been diagnosed with mental retardation was investigated. The 3 participants were measured repeatedly over time on 22 measurement occasions, and the intervention started on sessions 11, 13, and 20, respectively (see Fig. 2). From this figure, we might expect that there is an immediate reduction in challenging behavior when the treatment is introduced and that the effect of the treatment on the challenging behavior decreases over time (so there is a positive effect on the time trend during the treatment). We also see that the 3 participants’ scores on measurement occasions 4 and 10 are possibly influenced by an external event.

Fig. 2
figure 2

Graphical display of a multiple-baseline design across-3-participants designs using data from the study of Thompson, Iwata, Conners, and Roscoe (1999)

Results

If we ignore possible external events in the regression analysis before combining the effect sizes in the two-level meta-analyses, the average immediate treatment effect over cases for that study equals −25.58, and the average treatment effect on the time trend over cases from that study equals −2.58. If we take the external event into account by correcting the effect sizes before combining them, the immediate treatment effect equals −23.23, and the treatment effect on the time trend is 1.24. This means that \( {{\widehat{\gamma}}_{200 }} \) is 9.19 % smaller when the effect sizes are corrected, in comparison with the uncorrected effect sizes. Moreover \( {{\widehat{\gamma}}_{300 }} \) is positive for the corrected effect sizes, whereas it is negative for the uncorrected, which means that the effect of the treatment over time decreases for the corrected effect sizes, whereas it increases for the uncorrected.

Ignoring external events in a three-level meta-analysis

The three-level analysis of SSED data includes summarizing the immediate treatment effect and the treatment effect on the time trend over participants and over studies.

We estimate the immediate treatment effect and the treatment effect on the time trend across seven studies. Again, we use the meta-analysis of Heyvaert et al. (2012) to randomly select multiple-baseline across-participants studies. We combined the multiple-baseline across-participants study of Lindberg, Iwata, and Kahng (1999), Chung and Cannella-Malone (2010), Zhou, Goff, and Iwata (2000), Thompson et al., (1999), Hanley, Iwata, Thompson, and Lindberg (2000), Rolider, Williams, Cummings, and Van Houten (1991), and Roscoe, Iwata, and Goh (1998). In all these studies, the same dependent variable was measured—namely, the reduction in self-injurious behavior. Again, we compare the three-level meta-analysis of uncorrected and corrected effect sizes.

Results

With the uncorrected effect sizes in the three-level meta-analysis, the overall immediate treatment effect equals −33.14, t(6.39) = −3.44, p = .012, and the overall treatment effect on the time trend equals −4.42, t(3.95) = −1.52, p = .19. When the effect sizes are corrected before estimating the effects over participants, the immediate treatment effect equals −21.07, t(6.88) = −1.13, p = .30, and the treatment effect on the time trend equals −0.43, t(1) = −0.28, p = .83. This means that the immediate treatment effect of the corrected effect sizes is 36.42 % smaller, as compared with the uncorrected effect size, and the treatment effect on the time trend for the corrected effect size during the treatment is 90.27 % smaller.

This is consistent with the results of the simulation study, where we found that the estimated treatment effects are biased when the effect sizes are uncorrected before combining them in the three-level meta-analysis.

Discussion

External event effects are common in SSEDs because single-case researchers often implement these kinds of designs in everyday scenarios where they cannot control for outside factors (Christ, 2007; Kratochwill et al., 2010; Shadish et al., 2002). External events are not always anticipated by researchers, and thus, they may not be measured during the conduct of the study. Furthermore, the size of an event effect may be small, and researchers may be unaware of it even after the study has been completed. Whether researchers recognize an external event or not, the failure to account for the event in a meta-analysis can bias the estimate of the treatment effect. Thus, we searched for a method with which to model external events that could be applied even when the events had not been previously identified. Because we used a multiple-baseline across-participants design, there was a need to take into account the interdependence of the participants. Therefore, an external event that influenced the scores of 1 participant was assumed to influence the scores of the other participants in the same study.

We discussed two possible scenarios. In one scenario, the external event effect remains constant and influences the scores of all participants within a study on four subsequent moments. This occurs, for example, when a teacher is ill and a substitute teacher takes over the classroom or when a foreign observer is present on subsequent measurement occasions. In the second scenario, the external event’s effect would likely gradually fade away over four subsequent moments. For instance, the influence of a teacher intern on the behavior of students reduces over time. Moreover, the model adjusted for external event effects takes into account that measurement occasions closer in time are more related than measurement occasions further in time.

We evaluated this approach using a large simulation study and gave some empirical examples. If there is an external event effect of zero, both models (the one that corrects for moment effects and the one that does not) are appropriate. If the external event influences subsequent scores for all the participants within a study, the three-level approach for uncorrected effect sizes is not recommended, because the estimates of both treatment effects (i.e., immediate effect on level and effect on time trend) are substantially biased. The MSE, standard error, and CP are better estimated when the modified model, which includes moment effects, is used. The difference between the corrected and uncorrected effect sizes is largest when there are a small number of studies and measurement occasions, so in this context, we advise using the adjusted model. Moreover, the adjusted model results in less biased variance estimates.

But, of course, we should be aware of some limitations. We assumed that all the participants within a study are influenced the same way by the external event effect. It is possible that different participants from the same study are at separate locations and, therefore, are not all influenced by the external event. Modeling event effects that are not common to all participants in a study is an important avenue for future research. We chose to keep the number of measurements within a study constant for all participants within the same study. Of course, it is possible that different participants of the same study have different series lengths. Furthermore, we cannot generalize these results to other conditions not involved in this simulation study, but we partially addressed this by simulating a large number of conditions and choosing realistic values for the parameters.

Another limitation is that we assumed linear trajectories in the treatment phase, which might not be true in some real situations. To simplify the simulation model, we further did not account for a possible dependence between regression coefficients, which can be accounted for in a multilevel analysis by estimating the covariance at the various levels.

In addition, subjects in MBDs are repeatedly measured, and succeeding measurements may be more related to each other than measurements further away in time. We did not account for this possible autocorrelation and suggest that this as a useful extension to the present study.

Kazdin (2010) argued that there needs to be a minimum of three measurement occasions between the participants in an MBD in order to show an experimental effect. We did not take this into account in the condition where the number of measurement occasions was 15, because it was not possible to do this and provide each of 7 participants a unique baseline. We could alter the intervention schedule to introduce the treatment for some participants (e.g., randomly selected pairs) at the same moment. Examining this strategy specifically and alternative intervention schedules more generally would allow further research to extend results to a wider range of multiple-baseline applications.

It can be difficult to attribute simultaneously unusual outcome scores for all participants within a study to an external event effect. If there is no external event effect, we can still use the corrected model, because both the corrected and uncorrected effect sizes will be unbiased and, thus, there is no need to identify before the analyses whether an external event effect occurred or not. We advise single-case researchers to first use both models in the sensitivity analysis and then decide which model to use. If researchers are interested in the occurrence of external event effects, we recommend that they keep a log in order to identify potential outside factors that may influence the scores at certain measurement occasions and include dummy indicator variables at least for these moments.

The extension of the three-level model for multiple-baseline across-participants designs to include modeling of potential external effects makes it even more appropriate and useful for the analysis of realistic SSED data sets. This study has indicated that the three-level model corrected for external event effects provides better results than does the uncorrected model for combining results from multiple-baseline across-participants data, especially if there is only a small number of observations (I = 15) and a small number of studies (K = 10) in the synthesis. As was found here, even when an external event effect is small, a failure to correct for it can lead to biased effect sizes. Thus, applied SSED researchers are encouraged to consider use of the three-level model that corrects for external event effects when synthesizing results of MBD data.