[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Oct 2.
Published in final edited form as: Aphasiology. 2018 Mar 21;33(1):1–30. doi: 10.1080/02687038.2018.1454884

Statistical analysis in Small-N Designs: using linear mixed-effects modeling for evaluating intervention effectiveness

Robert W Wiley 1, Brenda Rapp 1
PMCID: PMC7531584  NIHMSID: NIHMS1506946  PMID: 33012945

Abstract

Background:

Advances in statistical methods and computing power have led to a renewed interest in addressing the statistical analysis challenges posed by Small-N Designs (SND). Linear mixed-effects modeling (LMEM) is a multiple regression technique that is flexible and suitable for SND and can provide standardized effect sizes and measures of statistical significance.

Aims:

Our primary goals are to: 1) explain LMEM at the conceptual level, situating it in the context of treatment studies, and 2) provide practical guidance for implementing LMEM in repeated measures SND.

Methods & procedures:

We illustrate an LMEM analysis, presenting data from a longitudinal training study of five individuals with acquired dysgraphia, analyzing both binomial (accuracy) and continuous (reaction time) repeated measurements.

Outcomes & results:

The LMEM analysis reveals that both spelling accuracy and reaction time improved and, for accuracy, improved significantly more quickly under a training schedule with distributed, compared to clustered, practice. We present guidance on obtaining and interpreting various effect sizes and measures of statistical significance from LMEM, and include a simulation study comparing two p-value methods for generalized LMEM.

Conclusion:

We provide a strong case for the application of LMEM to the analysis of training studies as a preferable alternative to visual analysis or other statistical techniques. When applied to a treatment dataset, the evidence supports that the approach holds up under the extreme conditions of small numbers of individuals, with repeated measures training data for both continuous (reaction time) and binomially distributed (accuracy) dependent measures. The approach provides standardized measures of effect sizes that are obtained through readily available and well-supported statistical packages, and provides statistically rigorous estimates of the expected average effect size of training effects, taking into account variability across both items and individuals.

Keywords: Small-N designs, mixed-effects, treatment study, dysgraphia, tutorial

Introduction

The analysis of single-case or Small-N Designs (what we will refer to as SND) has long been a controversial topic. Debate over what constitutes an appropriate statistical approach has gone as far as questioning whether statistics should be used at all (Evans et al., 2014), or only as support for visual analysis (Campbell, Herzinger, & Gast, 2010; Lane & Gast, 2014; but also see Davis et al., 2013). Renewed interest in statistical approaches has been driven in part by advances in statistical computing power enabling the development of new techniques for analyzing SND. Over the last several years, many articles and special issues have been dedicated to this topic in journals in neuropsychology and education, including Neuropsychological Rehabilitation (Evans et al., 2014), Journal of Behavioral Education (Burns, 2012), Journal of Counseling and Development (Lenz, 2015), and Journal of School Psychology (Shadish, 2014).

Various statistical approaches have been proposed for evaluating a single case or a small number of cases, both to compares cases to controls and for within-case studies. These approaches can be categorized into regression and non-regression approaches. The latter category includes but is not limited to: obtaining standardized measures of effect size (e.g., Campbell, 2004; Hedges, Pustejovsky, & Shadish, 2012; Parker, Vannest, & Davis, 2011), simulation modeling analysis (Borckardt & Nash, 2014), modified t-tests (Corballis, 2009; Crawford & Howell, 1998), and ANOVAs (e.g., Crawford, Garthwaite, & Howell, 2009; Mycroft, Mitchell, & Kay, 2002). In this paper, we present an application of linear mixed-effects modeling (LMEM), a type of multiple regression which is well-documented in the statistical literature and is increasingly investigated for its applicability to the designs common in neuropsychological training studies (e.g., Baek et al., 2014; Huber, Klein, Moeller, & Willmes, 2015; Moeyaert, Ugille, Ferron, Beretvas, & Den Noortgate, 2014; Owens & Ferron, 2012; Swaminathan, Rogers, Horner, Sugai, & Smolkowski, 2014; Shadish, Kyse, & Rindskopf, 2013; Shadish, Zuur, & Sullivan, 2014; Van Den Noortgate & Onghena, 2003). We argue that LMEM is well suited to the analysis of repeated measures data where single individuals serve as their own controls. LMEM can be applied to training data to determine whether training has had the desired effect and provides standardized measures of effect size. To provide a concrete demonstration of the validity and usefulness of LMEM to SND, we apply it to a dataset from a training study of five individuals with acquired dysgraphia. We illustrate a “trial-based” approach, wherein the individual measurements taken throughout the training study are modeled as longitudinal repeated measures, as opposed to the more common approach of aggregating measurements taken within a time point and evaluating the data as a time series.1

The primary goal of this paper is to introduce clinicians and aphasia researchers to LMEM and demonstrate its potential for analyzing the results of treatment studies that involve repeated measures across time. As will be demonstrated, LMEM has a number of properties (such as handling missing data and incorporating multiple, simultaneous estimates of random-effects) that offers advantages over other approaches like ANOVA or ordinary least squares regression (OLS). The first section provides a conceptual overview of LMEM and explanation of terminology, within the context of treatment studies. We then present a brief review of the existing literature on the use of LMEM specifically in SND. This review highlights the considerable work that has already been done showing how LMEM applies to SND. What is novel about this review is that, in addition to being written for an audience that does not specialize in statistics, we focus on repeated measures analysis and on accuracy data, which are more common in treatment studies but less researched in the statistical literature.

The second section provides a practical tutorial on conducting a repeated measures LMEM analysis of a SND, using the free statistical software R (R Core Team, 2016). We walk through the analysis and interpretation of results of a new treatment study of five individuals with dysgraphia. We also report (Appendix 2) the results of a simulation study investigating Type 1 error inflation in LMEM used to analyze accuracy (binomial) data, and present different options for obtaining p-values and measures of effect size for treatment effects.

Mixed-effects models: terminology and concepts

Fixed, mixed, and random-effects

“Mixed-effects”, “multilevel”, and “hierarchical” all refer to regression models in which the probabilities of a model’s parameters (e.g., the regression coefficients) are modeled at multiple levels, allowing for testing hypotheses not otherwise possible using OLS regression (see Dedrick et al., 2009, for a review). The term “mixed-effects” arises from the fact that these models include both fixed and random-effects. The determination of whether an effect should be modeled as fixed or random is somewhat contentious (see Gelman & Hill, 2006). For the purposes of understanding LMEM as it applies to analyzing SND, we adopt the pragmatic approach of considering the impact that including an effect as either fixed or random has on the interpretation of the results.

As an example, consider a study in which a clinician working with individuals with acquired dysgraphia launches a training program. Individuals are recruited from several clinics and are given a spelling app to practice spelling on a tablet device at home. Each individual practices 20 different words, and after every hour of practice the app administers a spelling test on those words. The app provides repeated measures data to the clinician: each individual’s performance on the spelling test at multiple time points. One question the clinician has is whether the individuals show significant improvement across the first 10 administrations of the spelling test (corresponding to the first 10 hours of practice). Thus, for each individual there are 200 observations (20 words × 10 time points).

The data could be modeled using OLS regression, where the dependent variable is accuracy on the spelling test items and the independent variable is the time point (1–10). Formula 1 (Table 1) represents a fixed-effects-only model. Here, y is the score on the naming test (the dependent variable), which the model attempts to predict via the variables on the other side of the equation. Specifically, a is some constant (the intercept), b is the coefficient of interest (the slope), x is the time point (the independent variable), and ϵ is error (the difference between the predicted and the observed values of the dependent variable). Both a and b are fixed-effects—the model solution will find the one value for a and the one value b that together minimize the error in predicting y, given different values of x. For example, the researchers might find that the best-fitting b = 2, interpreted as an increase of two points in accuracy from one administration of the spelling test to the next. These coefficients are fixed-effects because they are “fixed” to be the same for all “units”, meaning the effect of the dysgraphia treatment is predicted to be the same for all individuals, across all items on the spelling test, and across the different clinics from which the individuals were recruited.

Table 1.

Formulas reflecting: (1) a fixed-effects-only regression model, (2) the first level of a mixedeffects regression model, and (3a, 3b) individual coefficients for a random intercept and a random slope modeled at the second level of a mixed-effects regression model.

Formula 1: y = a + bx + Fixed-effects-only: estimates a single value for each coefficient (a and b)
Formula 2: yj = aj + βjx + ∈ Mixed-effects (1st level): estimates a value for each indexed coefficient (αj and βj) specific to each unit j (e.g., participant, item, etc.)
Formula 3a: aj = a + θj Mixed-effects (2nd level): unit-specific coefficient (random intercept αj) includes an estimated fixed-effect (intercept a) plus unit-specific random-effects (θj)
Formula 3b: βj = b + ηj Mixed-effects (2nd level): unit-specific coefficient (random slope βj) includes an estimated fixed-effect (slope b) plus unit-specific random-effects (ηj)

However, the researchers may have reasons to believe that the outcome of the intervention may vary across units, such as individuals or clinics. Formula 2 (Table 1) corresponds to a mixed-effects model; the coefficients a and β are modeled at two levels, the first of which incorporates variability across individuals. In this mixed-effects regression, at the first level, the subscript j indexes the different individuals included in the study. Within this formula, we count two random-effects “by-individuals”: an intercept αj and a coefficient βj. Each of these will be calculated for each individual, 1 through j. Like the model in Formula 1, this model hypothesizes that the spelling test scores can be predicted by a line with an intercept and a slope. However, Formula 2 additionally hypothesizes that this line may be different for each individual, j. This is accomplished because the parameters αj and βj are determined by further equations, namely the second level of the mixed-effects regression (Formulas 3a and 3b, Table 1).

The equations in Formula 3a and 3b reveal why these models are called mixed: they contain both fixed-effects and random-effects. For example, in Formula 3b, the coefficient b is the fixed-effect of time point (the slope for time point, as in the fixed-effects-only regression, Formula 1), and will be the same value for all individuals. The coefficient ηj is the random slope of time point and is specific to each individual (in mixed-effects terminology, the “random slope of time point, by-individuals”). The effect of time point is calculated for each individual as the sum of two values: the fixed-effect b plus the random-effect, ηj, which results in a different value for each individual. The same logic applies to the equivalent formula for the parameter αj reflecting the fixed and random-effects related to the intercept term.

Fixed-effect coefficients (e.g., a, b) indicate the effect size for the average unit, which in this example corresponds to how much accuracy improves per hour of practice. The difference between the estimate of b in Formula 1 vs. Formula 3 is the latter takes into account variability attributable to individual differences. The purpose of including random-effects is, in this sense, similar to including additional “control” predictors that researchers believe could affect outcome. However, a key strength of random-effects is they explain variance that can be attributed to dimensions of the data that are otherwise unexplained, without positing any specific hypotheses about how the units vary. For example, including sex as a control variable entails a specific hypothesis: that males will have lower values than females on the dependent variable, or vice versa. Including individuals as a control variable would similarly entail a specific hypothesis test—however, researchers seldom have hypotheses about how specific individuals will vary from one another. If they do, this means they have knowledge of some dimension(s) on which these individuals differ, and they would likely instead include those underlying dimensions as control variables.

Explanatory power of mixed-effects models

How can adding random-effects to a model change the inferences drawn from an analysis? Consider a hypothetical scenario in which the individuals recruited from one clinic showed faster improvement. To illustrate this, we created an artificial dataset in which individuals were recruited from five clinics. This data set was designed such that, collapsing across all five clinics, the correlation between time point and score on the spelling test is 0.20. However, the correlation is 0.91 averaging across individuals only from Clinic A, but in the other four clinics only 0.04. If the researchers ignored the testing site variability and modeled this data using OLS regression as in Formula 1, they would find that improvement on spelling accuracy across the first 10 administrations of the spelling test was significant, at p = 0.02. If instead the researchers adopted a mixed-effects approach and used Formula 2 with random-effects by-clinics, they would find that the relationship between time point and test score was not significant, at p = 0.35. This occurs because the correlation of 0.20 found when looking across all individuals can primarily be explained as an effect of Clinic A—this is reflected by the amount of variance explained in the mixed-effects approach, where only 1.4% is explained by the fixed-effect of time point, but an additional 67.5% is explained by the random slope for time point by-clinics. That is, the average individual’s score is poorly explained by the number of hours they used the app, but is quite well explained by the clinic. Although this is a hypothetical scenario, it is not far-fetched. Such a result could occur if there was a variable associated with the clinics that was not included in the regression model, but which was correlated with the outcomes. For example, perhaps Clinic A is associated with higher education levels or milder deficits relative to the other clinics. Ideally the researchers in this scenario would include variables like these as additional (fixed-effect) predictors, but it is not always possible to identify and quantify all such variables. This highlights an advantage of mixed-effects models: if there are unknown or otherwise unquantifiable sources of variance that are related to a unit (like clinic), then this variance can be taken into account by the random-effects structure.

The approach that we adopt in this paper is specifically appropriate to training studies in the neuropsychological context. We examine the small group (N = 5) situation, with random-effects by-items and by-individuals. Including these random-effects increases our confidence that treatment effects are not due to any subset of the items trained or individuals enrolled in the study.

Nested versus crossed random-effects

In the actual analysis we present in this paper, we use the approach of crossed random-effects by-items and by-individuals, as opposed to nested. To illustrate how nested and crossed random-effects differ, we consider a further scenario in our hypothetical study. The researchers include random-effects both by-individuals and by-items, such that instead of estimating a coefficient for each individual, βj (Formula 2, Table 1) a coefficient is estimated for each individual-item combination, βjk (where k indexes items). Items in this example correspond to the words included on the spelling test. If items are nested within individuals, then each βjk reflects how each item differs from the others within each individual. If items are crossed with individuals, then each βjk reflects how each item differs from the others across all individuals, in addition to how each individual differs from the others. The decision to use nested or crossed random-effects should be theoretically motivated, but essentially hinges on assumptions about what the units do or do not share. This nested relationship is not appropriate for all units, such as linguistic items like words, where in fact we expect individuals to share common knowledge of words. This motivates a statistical model in which there are crossed random-effects by-items and by-individuals.2

Small-N designs and LMEM

There are many potential applications of LMEM to treatment studies and, accordingly, researchers have investigated different aspects of the approach that are relevant to SND: necessary sample sizes (Raudenbush & Bryk, 2002; Hox, 1998; Ferron et al., 2009; Maas & Hox, 2005), statistical power and accuracy of estimates (Ferron, Farmer, & Owens, 2010; Heyvaert et al., 2017, Baayen, Davidson, & Bates, 2008; Barr, Levy, Scheepers, & Tily, 2013; Huber et al., 2015; Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017), meta-analysis of treatment studies (Owens & Ferron, 2012; Ugille et al., 2012; Van den Noortgate & Onghena, 2003, 2008; Moeyaert, Ferron, et al., 2014; Moeyaert, Ugille, & Ferron, et al., 2014), and how to customize the models to different types of experimental designs (Baek et al., 2014; Moeyaert, Ugille, & Ferron, et al., 2014; Moeyaert et al., 2015; Shadish et al., 2013, 2014). The majority of this research has focused on the analysis of continuous outcome measures (e.g., reaction times), as opposed to count or binomial data (e.g., accuracy) where logistic regression is needed. This research has also typically assumed that the data will be modeled as a time series, where a single measure is obtained for each time point. Importantly, the LMEM approach that we present does not treat the data as a time series, but rather as longitudinal repeated measures, just as in the example of the spelling app. The difference lies in the fact that the repeated measures design provides multiple observations at each time point. This distinction cannot be over-emphasized: we are advocating that for SND in which individuals are tested on multiple items within a training session, performance on the items should not be averaged together. This allows for an items-based, trial-by-trial analysis in which the regression model attempts to predict the responses on each trial of each item. Not averaging items together increases the number of observations, maintains information about variance across items, and allows the researcher to control for variables related to the items. For example, in the spelling app study, additional fixed-effects of word length could be included to test if spelling accuracy improved more quickly for some words than others.

Repeated measures LMEM guidelines

Summarizing the literature on LMEM in SND, we identify guidelines for using the approach to quantify treatment study results. Rather than an exhaustive review, we focus specifically on using LMEM in longitudinal repeated measures designs, with the goal of obtaining an effect size for the effect of treatment.

Sample size

A distinction is made between sample sizes at the different levels of a LMEM. In the spelling app example, the first-level sample size refers to the number of observations from each individual; the second-level sample size is the number of individuals, if there is a random-effect for individuals. Based on the literature (e.g., Raudenbush & Bryk, 2002; Hox, 1998; Ferron et al., 2009; Maas & Hox, 2005), an adequate sample size for the second-level is as few as five units, if the question of interest relates to the fixed-effects. This means that data from five individuals is sufficient to estimate the treatment effects (beta coefficients of the fixed-effects). However, the estimated variance across individuals (the random-effects estimates) will be susceptible to inflated Type I error.3 Those interested in the estimating how much treatment effects vary across individuals, as opposed to the effect size for the average individual, are thus encouraged to collect more data (Ferron, Farmer, and Owens, 2010). Regarding the sample size at the first-level, there are no constraints particular to the LMEM approach—if there are enough observations for an OLS regression then there are enough for LMEM. In general, the more observations the better, and, importantly, there is evidence that a larger sample size at the lowest level can somewhat compensate for smaller sample sizes at the higher levels (Maas & Hox, 2005).

Autocorrelation

Serial dependency, or autocorrelation, is a problem in regression analysis when the residual errors (the ϵ in Formulas 1 and 2, the difference between the predicted and the observed values) violate the assumption of independence.4 Autocorrelation means the residual errors for observations at time t are correlated with the residual errors at some other time, such as t-1. This can lead to inflated Type I error. Autocorrelation is not only a problem for LMEM or regression but it has even been found to influence visual analysis of time series (Matyas & Greenwood, 1996). Multiple methods can address autocorrelation; arguably the best approach is to identify the underlying cause(s) and add predictors to remove it (Hanke & Wichern, 2005). Although autocorrelation can arise whenever observations are ordered, it is prominently a concern for time series analyses. We refer to other researchers for methods for identifying and addressing autocorrelation in LMEM (see e.g., Baek & Ferron, 2013; Davis et al., 2013; Howard, Best, & Nickels, 2015; Swaminathan et al., 2014). However, it should be noted that the primary questions of interest in treatment studies (e.g., whether or not treatment was effective for the group) are likely to be addressed by the fixed-effects, and importantly, estimates of the fixed-effects will be unbiased even without explicitly incorporating autocorrelation estimates into the LMEM (Shadish et al., 2013; Singer & Willett, 2003).

Determining random-effects structure

One of the challenges with LMEM is determining the optimal random-effects structure. Researchers familiar with multiple regression will recognize the general challenge of determining which variables to include in any model; the inclusion of random-effects adds to this challenge. First, one must decide which units should be included as random-effects—while items and individuals are the most common, one is not limited to those (sites, clinicians, etc.). Second, one must consider which random intercepts and slopes to include by-those units. Much attention has been paid to this difficulty in the psycholinguistics literature on mixed-effects (see Baayen et al., 2008; Barr et al., 2013; Matuschek et al., 2017). Barr et al. (2013) argued for the “maximal structure”: random intercepts by-items and by-individuals as well as random slopes by-items and by-individuals for each of the fixed-effects.5 However, the maximal structure is not always possible for practical reasons, perhaps the chief being a limited number of observations. The minimum number of observations required is the sample size at the higher-level multiplied by the number of random-effects. For example, including one random intercept and one random slope by-items, in a study with 40 items, requires at least 80 observations.

Using model simulations, Matuschek et al. (2017) reported that the minimal Type I error rate achieved by the maximal structure comes at the cost of statistical power, and is likely to be overly conservative. Therefore, it is recommended to determine a reduced random-effects structure—this can be determined through model comparison, such as by the Akaike Information Criterion (AIC, Sakamoto, Ishiguro, & Kitagawa, 1986) or likelihood ratio tests, but the starting point should be theoretical considerations. Just as the researcher must decide what variables to include as fixed-effects based on their research question and experimental design, the decision of which random-effects to include should be based on which are thought to vary across-units. In the case of repeated measures SND, the most critical random-effects are going to be: (1) those related directly to the experimental conditions (e.g., treatment conditions), and (2) those related to time (e.g., session number)—we suggest that random slopes for treatment conditions and time variables be included both by-individuals and by-items. In the spelling app example, this would mean including a random slope of time point both by-individuals and by-item.

Effect sizes and p-values/confidence intervals

One advantage of LMEM is the ease of obtaining standardized effect sizes—the beta coefficients from the fitted models correspond to effect sizes. In logistic regression, used to analyze binomial data, the beta coefficients are reported in log-odds, which are easily interpretable as changes in the probability of a correct response. For this reason, LMEM is useful for meta-analysis or otherwise combining results across multiple studies (see Owens & Ferron, 2012; Ugille et al., 2012; Van den Noortgate & Onghena, 2003, 2008; Moeyaert, Ugille, Ferron, et al, 2014; Moeyaert, et al., 2015).

Obtaining p-values is less straightforward and somewhat controversial (Kuffner & Walker, 2017). Specifically for LMEM, the challenge lies in enumerating degrees of freedom, which is complicated by random-effects. There are different formulas for this calculation, including the Kenward-Rogers and Satterthwaite approximations, or the Wald-Z test for logistic regression. The first has been shown to provide accurate results for SND (Ferron et al., 2009; Heyvaert et al., 2017), and is available in most statistical software. Another option is the likelihood ratio test (LRT), which provides a Chi-square measure of goodness of fit when comparing two models. While it has been suggested that the LRT may be preferable in SND (see Moeyaert et al., 2014), we conducted a simulation study to compare the performance of the LRT and Wald-Z tests in logistic LMEM, and report the novel finding that the LRT better controls Type I error in the study we report with a sample size of 5 (Appendix 2).

Dysgraphia treatment: analysis and results

In the introduction, we presented a hypothetical dysgraphia study to situate the LMEM approach in the context of small-N training studies. Next, we present new results of an actual dysgraphia treatment study. In the hypothetical study, the question was whether practice with a spelling app improved accuracy. Time point (administrations 1–10 of the test) was the independent variable, and was modeled as a fixed-effect, with its effect size (beta coefficient) and its p-value the parameters of interest. We discussed including multiple random-effects to account for variability in outcomes due to factors related to differences across items and individuals.

In the actual dysgraphia study, participants (N = 5) underwent spelling treatment for a number of sessions, under two different training schedules (manipulated within-participants). The goal, as in the hypothetical study, was to assess the amount of improvement in spelling ability per training session. However, the actual study additionally asks if this improvement differed for the two training schedules.

To demonstrate the use of LMEM, we present the results of two analyses and describe how to interpret the LMEM output and effect sizes. Analysis 1 examines the improvement in spelling accuracy attributed to the treatment, and also the difference between two treatment conditions. Analysis 2 examines the improvement in reaction time. A tutorial with full explanation for conducting these analyses using the software R (R Core Team, 2016) appears in Appendix 1.

We note that the approach demonstrated here is readily applied to the analysis of single case studies by simply removing the random-effects by-individuals but keeping the random-effects by-items. As long as the individual has been administered at least five items, a mixed-effects analysis is feasible. This highlights another benefit of analyzing the data as repeated measures (i.e., trial-by-trial) as opposed to as a time series: if the study includes fewer than five individuals and only one observation per time point, a mixed-effects analysis is not feasible.

Study rationale

The goal was to determine if there was a significant effect of training schedule (the spacing of learning trials) on the relearning of word spellings. There is an extensive literature with neurotypical adults examining the effects of the spacing of learning trials. Results very consistently show that distributing training trials across a session (or distributing training sessions across a training period) produces superior learning relative to massing training trials within a session (or massing training sessions across a training period) (see Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006 for a meta-analysis). Somewhat surprisingly, this issue has received scant attention in research on treatment of language disorders, though it is sometimes related to the more researched question of training intensity (e.g., Sage, Snell, & Lambon Ralph, 2011). Recently, Middleton, Schwartz, Rawson, Traut, and Verkuilen (2016) examined the effects of massed vs. distributed scheduling of training trials within sessions for individuals receiving treatment for naming deficits. Their findings, consistent with those reported for neurotypical individuals, showed clear superiority for distributed over massed schedules. Here, we evaluate two schedules that varied the spacing of the training trials: distributed and clustered schedules, with the latter a type of massed spacing. This manipulation allowed us to compare “bursts” of intense training across the training period with more regular, but less intense, repetition. More details are provided below.

Participants

Five college-educated individuals (2 females) ages 45–80 with acquired dysgraphia resulting from a single left hemisphere stroke participated in the study. They were, on average, 4 years post-stroke at enrollment and had no pre-morbid history of spelling or reading difficulties. See Table 2 for demographic information and language and cognitive testing results. It can be see that there was considerable variability in performance levels for most language measures: word and pseudoword spelling, oral reading, and spoken picture naming. However, word comprehension was excellent for all but KST and, in terms of working memory, all had normal visual-spatial spans. Regarding their specific dysgraphia deficits, 2 had only orthographic working memory deficits, 1 only an orthographic long-term memory deficit, and 2 had mixed deficits (see Buchwald & Rapp, 2009, for information on these dysgraphia deficit types). The same treatment approach was used with all participants, given previous research showing its effectiveness across these deficit types (Rapp, 2005; Rapp & Kane, 2002).

Table 2.

Demographic information and results of language and cognitive testing for the five individuals (initials: DTE, KST MSO, PQS, RHN). O-WM=orthographic working memory; O-LTM=orthographic long-term memory. JHU Word Length List and JHU Pseudoword List (Johns Hopkins University Dysgraphia Battery; Goodman & Caramazza, 1985) PALPA 35 (Psycholinguistic Assessments of Language Processing in Aphasia Task 35; Kay & Coltheart & Lesser, 1992) NNB (Northwestern Naming Battery; Thompson et al., 2012) PPVT (Peabody Picture Vocabulary Test; Dunn & Dunn, 1981), Corsi spatial span (Mueller & Piper 2014).

Individual DTE KST MSO PQS RHN
Age 80 61 45 54 75
Sex F M M M F
Handedness R R R R L
Education (years) 18 14 18 18 19
Lesion Location L frontal/parietal L frontal/parietal L frontal/temporal L frontal/parietal L posterior frontal
Dysgraphia deficit type/s O-WM Mixed Mixed O-WM O-LTM
Number of training sessions 40 29 29 17 16
Number of training trials per item 18 16 20 13 14
Spelling: JHU Word Length List (letter accuracy /336) 59% 82% 33% 81% 89%
Spelling: JHU Pseudoword List (letter accuracy /99) 66% 64% 19% 83% 92%
Auditory word comprehension: PPVT 91st percentile 10th percentile 50th percentile 99th percentile 73rd percentile
Oral reading: PALPA 35 (/60) 90% 55% 85% 68% 100%
Spoken picture naming: NNB (/66) 94% 56% 83% 97% 98%
Forward digit span 5 3 3 4 6
Corsi spatial span 5 5 5 5 5

Methods

There were 5 study phases: pre-training evaluation, spelling training, post-training evaluation, 3 month waiting period, and follow-up evaluation. Here, we analyze data collected during the spelling training phase. Forty training words were selected for each individual such that baseline letter accuracy on each word was between 25% and 85%. A spell-study-spell technique (Beeson, 1999; Rapp & Kane, 2002) was administered for approximately 90 min, typically 2×/week for an average of 13 weeks per individual. Each training trial was as follows: (1) the individual heard a target word, repeated it, and attempted to write the spelling. The accuracy and RT data for this response served as the data analyzed in this paper. (2) Regardless of accuracy, the individual was shown the correct spelling while the experimenter said aloud the word’s letters. The individual copied the word once. (3) If the word was spelled correctly at Step 1, then Step 3 was omitted and the experimenter continued to the next item; otherwise, the word was removed from view and the individual was asked to spell it. Steps 2 and 3 were repeated until the word was spelled correctly or for a maximum of 3 times before moving to the next item.

Each individual had one word set for each training schedule (Figure 1). In the distributed training schedule, words were trained once per session every 2–3 sessions, while in the clustered schedule, words were trained 3–4 times within a session every 6–8 sessions. The total number of trials per word was the same regardless of schedule, ranging from 13–21 depending on the individual (Table 2). Using the example in Figure 1, “pen” would provide 12 observations of RT, and 36 (3 letters × 12 trials) observations of accuracy.

Figure 1.

Figure 1.

Example of the training schedules, either clustered (“cat”) or distributed (“pen”).

Table 3 reports summary results for each individual and the group average, showing general improvement across the training study and retention at follow-up.

Table 3.

Summary of training study data for each individual and group averages. Accuracy and RT are computed per letter. Observations for Accuracy Analysis = number of letters attempted over the course of the study. Observations for RT Analysis = number of words spelled correctly over the course of the study. S = seconds. First or last 25% indicates the mean during the initial or final quarter of the training program and follow-up accuracy refers to accuracy approximately 3 months after the end of treatment.

Individual Observations for Accuracy Analysis Observations for RT Analysis Mean Accuracy, first 25% Mean Accuracy, last 25% Mean Accuracy, follow-up Mean RT (s), first 25% Mean RT (s), last 25%
DTE 2484 394 84% 95% 87% 10.4 7.3
KST 2460 299 83% 88% 87% 4.7 3.6
MSO 2844 364 60% 80% 78% 2.7 2.9
PQS 2180 269 80% 97% 94% 5.6 2.9
RHN 2372 276 87% 99% 98% 2.8 0.8
Average 2468 320 78.8% 91.8% 88.6% 5.24 3.50

Analysis 1: effect of treatment on spelling accuracy

This analysis address two questions: (1) is there a significant effect of training, on average, across individuals? and (2) is there a significant difference in training effectiveness between the two training schedules (distributed versus clustered)? With the clustered schedule, accuracy improved across the within-session trials (from the first to the fourth trial) but fell sharply from the fourth trial to the first trial of the subsequent session. This non-linear “sawtooth” pattern, would need to be accounted for in the regression by including Within-Session Trial Number as an additional predictor. However, it then would become impossible to include items receiving clustered and distributed in the same model, because “within-session number” is not a relevant variable for the distributed schedule. Therefore, only the first within-session trials were included for the clustered condition (e.g., in Figure 1: the first trial with CAT on sessions 1, 9 and 17). This effectively reduced the amount of available data for this particular analysis by ~35%.

Model description

The model included six fixed-effects: the effects of Schedule (distributed versus clustered), Session (the session on which an item was trained), the interaction of Schedule × Session, DaysSince (the number of days since last training on that word), Word Length, and Word Frequency. Crossed random-effects were included both by-items (with a random intercept and slopes for Session and DaysSince) and by-individuals (with a random intercept and slopes for Schedule, Session, DaysSince).6

To illustrate the impact of including random-effects, Figure 2 depicts how different model structures would estimate the effect of Session depending on which random-effects were included. Without any random-effects (Figure 2(a)) the estimated effect of Session would be identical for all individuals. Including only a random intercept by-individuals (Figure 2(b)) would take into account that different individuals began training (Session = 0) with different performance levels, but would estimate the slope of Session to be the same for all individuals. Including only a random slope for the effect of Session by-individuals (Figure 2(c)) would account for different rates of improvement across individuals, but ignore the variability in initial spelling accuracy (the intercept). Finally, both a random intercept and slope for Session by-individuals (Figure 2(d)) takes into account differences across individuals both in terms of initial accuracy and improvement rates. Allowing for variability in both intercepts and slopes can improve the model’s ability to explain the data, resulting in more accurate estimates of the average effects.

Figure 2.

Figure 2.

Depiction of the results of different model structures: (a) fixed-effects-only, (b) random intercept by-individuals only, (c) random slope for effect of Session by-individuals only, and (d) both random intercept and slope for the effect of Session by-individuals. The x-axes depict the training session and the y-axes the percent letter accuracy. In the random-effects models (b-d), separate regression lines for the effect of Session are estimated for each individual (DTE, KST, MSO, PQS, and RHN).

Lastly, because the by-items and by-individuals random-effects were crossed, this means that item-specific effects (variability attributed to the different training words) were based on responses from all individuals who were trained on that item. For example, three of the participants were trained on “scout”, and so variability from that word is estimated based on these three individuals’ responses to it (if the random-effects were nested, the effect of “scout” would be calculated separately for each individual).

Model results and interpretation

The results of Analysis 1 are reported in Table 4; details of the LMEM structure are available in Appendix 1 (“Analysis 1: Model Formula & Description”).

Table 4.

Results of LMEM Analysis 1 (letter accuracy). *** p < 0.001, ** p < 0.01, * p < 0.05.

Estimate Std. Error z value LRT p
(Intercept) 3.531 0.586 6.030 0.000***
Schedule 0.035 0.173 0.202 0.627
Session 1.316 0.207 6.345 0.001**
Frequency 0.199 0.098 2.028 0.048*
Length −0.195 0.095 −2.046 0.042*
DaysSince −0.217 0.091 −2.385 0.059
Schedule*Session 0.127 0.057 2.225 0.040*

Table 4 reports the following: The first column, Estimate, refers to the estimated value of the beta coefficients. The second column, Std. Error, is the standard error of the estimated betas. The last two columns present z-values of the estimated beta coefficients and the likelihood ratio test (LRT) p-value. The selection of the LRT p-value was made on the basis of the literature review as well as the results of an original Type I error simulation (see Appendix 2). The visualization in Figure 3 depicts the slopes of the fixed-effects (i.e., the expected change in spelling accuracy (y-axis) per change in the independent variables (x-axis).

Figure 3.

Figure 3.

Visualization of the LMEM for Analysis 1 effects (using effects package in R; see Appendix 1). Each panel depicts the predicted percent letter accuracy across sessions for: (a) Word Frequency, (b) Word Length, (c) DaysSince, and (d) the Session × Schedule interaction. The gray bands reflect 95% confidence intervals for the associated effects.

Table 4 shows that all predictors except Schedule and DaysSince are significant at p < 0.05. Critically, there is a significant effect of Session, indicating increased accuracy over time, as well as a significant interaction of Schedule × Session. Because Schedule was coded as +1 for clustered and −1 for distributed, the positive slope of the interaction term reveals that spelling accuracy improved more rapidly for the clustered compared to the distributed schedule. The generalized R2 measure of variance explained (see Appendix 1 “Effect size measures”) equaled 50.3% in total, with 22.0% explained by all of the fixed-effects and 10.8% by the effect of Session (i.e., 49% of the variance explained by fixed-effects was due to the effect of Session).

Summary of analysis 1

The results of this analysis confirm the spell-copy-spell training procedure was on average effective for improving the spelling accuracy of the five individuals. The beta for Session was found to be = 1.316 (p = 0.001). Because this is a logistic regression, exponentiating the beta values provides a standardized effect size, namely the odds ratio: eβ. Thus, the effect size of Session is e1.316 = 3.73. This can be interpreted as ≈34% improvement per training session in the odds of a correct trial. An individual who began treatment with a 50% chance of correct responses would improve to 93% after 10 sessions (for details on interpreting the odds ratio, see Appendix 1 “Effect size measures”).

The results also revealed a significant interaction of Schedule × Session (beta = 0.127, p = 0.04), such that the effect of training for the clustered schedule was (1.316–0.127) = 1.189 and for the distributed schedule was (1.316 + 0.127) = 1.443. This means, e.g., an individual beginning with 50% probability of producing a correct letter improved to ≈90% in the clustered schedule versus ≈96% in the distributed after Session 10.

Two control variables were also significant, indicating that performance was worse on words that were longer (effect of Length = −0.195), and better on higher frequency words (effect of Frequency = 0.199). Altogether, the results of the LMEM reveal that training was effective in improving spelling accuracy, and more so for the distributed schedule, even after controlling for the significant effects of word length and frequency.

Overall the results of Analysis 1 reveal that the spell-copy-spell training procedure was successful in improving the spelling accuracy of these five individuals with acquired dysgraphia. The LMEM provides standardized measures of the effect sizes (in the form of beta coefficients and amount of variance explained) as well as p-values for statistical significance.

Analysis 2: effect of treatment on reaction time

The second analysis reports (Table 6 and Figure 4) the improvement in reaction time (RT, measured as seconds per letter) only for trials with correct responses.

Table 6.

Results of LMEM Analysis 2 (RT per letter for words that were correctly spelled). * p < 0 .05.

Estimate Std. Error t value Satterthwaite p
(Intercept) 1.062 0.342 3.109 0.036*
Schedule −0.029 0.033 −0.875 0.416
Session −0.247 0.063 −3.957 0.016*
Frequency −0.018 0.028 −0.617 0.538
Length −0.021 0.028 −0.755 0.451
DaysSince −0.004 0.024 −0.146 0.889
Schedule*Session 0.015 0.017 0.910 0.363

Figure 4.

Figure 4.

Visualization of the LMEM for Analysis 2 effects (using effects package in R; see Appendix 1). Each panel depicts the predicted percent letter accuracy across sessions for: (a) DaysSince, (b) Word Frequency, (c) Length, and (d) the Session × Schedule interaction. The gray bands reflect 95% confidence intervals for the associated effects.

Summary of analysis 2

It should be noted that the beta coefficients reported in Table 6 are not odds ratios, but instead indicate the expected change in log RT per unit of Session.

The results reveal a significant effect of training (Session = −0.25), which equates to a 3% decrease in RT per training session. No other predictors were significant in Analysis 2. Unlike the findings regarding accuracy, for RT there was no interaction of Schedule × Session, indicating no significant advantage of for one treatment schedule over the other for RT on correct trials.

General discussion

A statistical approach to analyzing SND is desirable in order to obtain standardized effect sizes and objective measures of significant outcomes, both of which are difficult to achieve with visual analysis. Researchers are faced with an array of choices in statistical techniques, the applicability of which depends on the experimental design. In this paper, we argued for using LMEM to analyze treatment effects in repeated measures studies for two primary reasons: 1) By modeling all individual observations instead of averaging scores within-time points (i.e., trial-by-trial repeated measures analysis, not time series), more information and more data points are retained, and, furthermore, predictors related to the specific items (e.g., word frequency) are included. 2) The addition of random-effects improves upon OLS regression, testing the hypothesis that differences across units (e.g., items, individuals) may have effects on the outcome measure.

The LMEM analysis demonstrated here uses a “trial-based” approach, distinguishing itself from the common approach of averaging together multiple observations collected within a session. For example, a trial-based approach provides 100 data points if ten measures are collected on each of 10 training sessions, compared to only 10 data points under the common approach. A trial-based approach is particularly beneficial for SND where the paucity of data is a concern. We focused on logistic regression, which can be used to analyze data types that are more common in treatment studies (i.e., accuracy or other binomially distributed data). We also conducted an original simulation study that confirms the Wald-Z test may lead to inflated Type I error in logistic LMEM in the analysis of SND, and recommended the use of the LRT for obtaining p-values.

There remain outstanding questions about using multilevel modeling with SND, as outlined by Shadish et al. (2013). Briefly, the two major issues are statistical power and autocorrelation. Regarding the former, while we have demonstrated that LMEM is viable with as few as five participants, a relatively large number of observations within these participants is necessary for sufficient statistical power. It is impossible to state an exact number of required observations, because power depends on a number of factors, perhaps most crucially the size of the treatment effect, which is often not known in advance (see also Ferron, Farmer, & Owens, 2010; Heyvaert et al., 2017, Baayen et al., 2008; Barr et al., 2013; Huber et al., 2015; Matuschek et al., 2017). For increasing the number of observations, we have highlighted the advantage of modeling individual trials, rather than averaging within-session observations into a single measure which, of course, is only relevant where there are multiple observations or items tested within sessions.

In addition to limitations of statistical power, any regression-based approach in which time is a variable faces the issue of autocorrelation. We have underscored, however, that estimation of the fixed-effects is unbiased in the presence of autocorrelation and, moreover, there is reason to believe that a complex random-effects structure (e.g., with random intercepts and slopes for effects of treatment and variables related to time) sufficiently addresses the issue of autocorrelation (see Shadish et al., 2013). In summary, the primary limitations of LMEM in the context of SND are low statistical power (a concern for all SND analyses) and potentially poor estimation of the random-effects parameters. However, the goal of this paper was not to solve these limitations per se, but rather to explain LMEM to aphasia researchers, and provide practical guidance to inform decisions about experimental design and analysis.

LMEM compared to other approaches

We have advocated for a LMEM approach to longitudinal repeated measures data—but what about alternative statistical approaches? As referenced in the introduction, LMEM is not the only approach available for the statistical analysis of SND. However, LMEM has a number of advantages. It has long been recognized that LMEM improves upon ANOVA (e.g., Gueorguieva & Krystal, 2004; Baayen et al., 2008) by addressing common pitfalls, including unequal spacing of observations and missing data, and in the worst case LMEM will return identical results as a comparable ANOVA. Regarding other advantages, in the treatment study we reported on, time (Sessions and DaysSince) was treated as a continuous variable, which is also not possible in the ANOVA framework. Another challenge for ANOVA is addressing the variability across both individuals and items—the traditional approach of collapsing across each separately to compute an F1 (by-individuals) and F2 (by-items) is inherently problematic (e.g., a significant result is obtained in the F1 but not the F2 ANOVA). Note also that a by-individuals ANOVA is equivalent to a LMEM with a random intercept by-individuals (and no other random-effects), and similarly a by-items ANOVA is equivalent to an LMEM with a random intercept by-items only. Thus, mixed-effects models avoid this F1–F2 ANOVA problem, by allowing simultaneous modeling of random-effects for both individuals and items, as well as the possibility of including both random intercepts and slopes. Furthermore, in LMEM missing data points result in losing only those observations, as opposed to the common practice of listwise deletion in ANOVA that results in losing an entire individual’s data.

In short, LMEM provides all of the benefits of statistical analysis of other approaches like ANOVA and OLS regression, and at the very least will produce the same results and lead to the same interpretation. However, in most cases, LMEM is likely to provide better results: it handles missing data and unevenly spaced observations, allows more flexibility in modeling both dependent variables (e.g., continuous, binomial, count data) and independent variables (e.g., time can be treated as categorical or continuous), and simultaneously models both across-individual and across-item variability. LMEM is also well-documented and well-supported in multiple statistical software platforms.

Recovery of function in acquired dysgraphia

Although the focus of this paper has been the statistical evaluation of training data from SND, we will briefly summarize and discuss four aspects of the findings concerning the dysgraphia treatment itself. The first two points concern the general success of the treatment protocol, while the latter two address current issues in learning and memory in the context of treatment of cognitive deficits.

First, the study provides additional evidence of positive outcomes in remediation of acquired dysgraphia (e.g., Beeson, 1999; Rapp & Kane, 2002). In addition to the group analyses, we also estimated treatment effects for each individual, by computing LMEMs identical to those reported here (Analysis 1 and Analysis 2; See Appendix 1), dropping the random-effects by-individuals (as each model was for one individual) but keeping the by-items effects. The results revealed that each of the five individuals had statistically significant gains in spelling accuracy for the trained words and 4 of 5 improved their writing times.7 Furthermore, although not the focus of the analyses that were presented, all five maintained their gains at 3 months follow up (see Table 3). Second, as also found in Rapp and Kane (2002) and Rapp (2005), the same treatment approach was effective for individuals with underlying spelling deficits affecting orthography working memory, orthographic long-term memory or both. Why the same treatment should be successful with different deficits is not entirely clear and future research directed at understanding this would be useful for developing remediation approaches tailored to the underlying deficits.

With regard to mechanisms of learning and re-learning, the success of the study-spell-study treatment approach is relevant to the debate regarding errorless/errorful learning, since it falls within the class of errorful learning methods. On each trial, individuals are first asked to spell the word and are allowed to produce errors. Then, during the subsequent copy portion of the trial, they are given feedback and practice and, finally, they are asked to try again to retrieve and produce the spelling. While this study did not compare the benefits of errorful vs. errorless learning, it does provide evidence that errorful learning can be successful (see Middleton & Schwartz, 2012 for a review and similar conclusions). Interestingly, research on learning and memory suggests that the success of this type of approach likely lies in the opportunities for repeated retrieval attempts in the “spell” components of each trial. Specifically, research with neurotypical individuals finds that learning is improved more when the learner attempts to recall information (whether successful or not) rather than simply restudying it; this benefit of retrieval over study is referred to as the “retrieval practice” effect (for a review, see Roediger & Karpicke, 2006).

Finally, and perhaps most significantly, we find a clear benefit for a distributed compared to a clustered training schedule. Recall that for each individual, all items were trained for the same number of trials, but half the items were trained using a clustered schedule (3–4 trials per session) and half with a distributed schedule (1 trial per session). Results revealed that, for the group, the distributed compared to the clustered schedule produced significantly greater accuracy. This finding is consistent with the many studies in the neurotypical learning and memory literature that support distributed over massed practice (Cepeda et al., 2006) as well as more recent work with individuals with aphasia (Middleton et al., 2016; Sage et al., 2011). These findings should provide some guidance to clinicians in planning treatment schedules.

Conclusions

This paper provides a strong case for the application of LMEM to the analysis of training studies as a preferable alternative to visual analysis or other statistical techniques. We have presented evidence that the approach is robust under the extreme conditions of small numbers of individuals, in the context of repeated measures training data for both continuous (reaction time) and binomially distributed (accuracy) dependent measures. The approach provides standardized measures of effect sizes that are obtained through readily available and well-supported statistical programs (i.e., R, R Core Team, 2016), and provides statistically rigorous estimates of the expected average effect size of training effects, taking into account variability across both items and individuals.

Acknowledgments

We are very grateful to Jennifer Shea for her many valuable contributions to this project, from individual testing through data analysis. We also thank Colin Wilson for guidance on statistical theory. This work was supported by the multi-site NIH-supported grant DC006740 examining the neurobiology of language recovery in aphasia.

Funding

This work was supported by the National Institutes of Health [DC006740];

Appendix 1. Tutorial

R packages and set-up

First, the following R packages must be installed: lme4 (for fitting the models; Bates, Maechler, Bolker, & Walker, 2015); lmerTest (for obtaining Satterthwaite p-values; Kuznetsova, Brockhoff, & Christensen, 2017); effects (for visualizing effects; Fox, 2003), and MuMIn (for calculating variance explained; Bartoń, 2009):

install.packages(“lme4”)
install.packages(“lmerTest”)
install.packages(“effects”)
install.packages(“MuMIn”)

Then the packages must be loaded:

library(lme4)
library(lmerTest)
library(effects)
library(MuMIn)

The data should be loaded, which can be done from a number of formats. For example, a comma separated values file (“.csv”) from Excel, with the data for every trial and every individual, can be loaded:

group_data<-read.csv(“group._data.csv”,header = T,sep = “,”)

The data frame group_data now exists, which in this case contains the following columns:

  1. timeperletter: RT on each trial

  2. letterpercent: proportion of letters spelled correctly

  3. DaysSince: number of days since last presentation of the target word

  4. Frequency: (log) word frequency (using SUBTLEX-US, Brysbaert & New, 2009)

  5. Length: number of letters in the word

  6. Schedule: either Clustered or Distributed

  7. Session: training session number

  8. Individual: The individual’s initials

  9. Target: The word spelled.

Consideration should be given to whether continuous variables should be mean-centered or standardized (mean-centered and divided by the standard deviation); this is recommended in particular when there are interaction terms in the model, as it will reduce multicollinearity. In the analyses presented here, all continuous variables were standardized. Any categorical variables that will be entered as predictor variables should be indicated as such to R, e.g., the variable Schedule, which can be referred to within the group_data frame by use of the $ symbol:

group_data$Schedule <- as.factor(group_data$Schedule)

The appropriate contrasts for the levels of the categorical variable must also be set. By default, dummy coding is used, e.g., a two-level categorical variable is coded as 0 and 1 (with the first level, alphabetically, coded as 0). To set the contrasts to sum coding, e.g., the Schedule variable coded as −1 and 1:

contrasts(group_data$Schedule) <- “contr.sum”

With dummy coding, the intercept reflects the mean for the level of the variable that is coded as 0 (the “reference level”), and the betas associated with that variable reflect the difference between the level(s) of the variable coded as 1 and the reference level. In sum coding, the intercept instead reflects the grand mean (the mean of means for each level of the variable), and the betas associated with that variable reflect the difference between the level(s) of the variable coded as 1 and the grand mean. Sum coding may be preferable to dummy coding in the presence of higher-order interactions (e.g., the interaction between two categorical variables) because it is more interpretable (always reflecting the difference between some level of the categorical variable compared to the overall effect). See R documentation, command help(“contrasts”), for more information and alternative coding schemes. The data are now ready for analysis.

Analysis 1: model formula and description

The dependent variable of accuracy is measured as the proportion of letters correct in each word, and is modeled as a binomial distribution: e.g., the word “cat” is modeled as three trials, one for each letter. In R, the left-hand side of the equation represents the dependent variable, and the right hand side the formula for the fixed-effects, followed by the random-effects. For this analysis, the formula specified was:

model1 <- glmer(letterpercent ~ Schedule*Session + DaysSince + Frequency + Length
+ (1 + Schedule + Session + DaysSince | Individual)
+ (1 + Session + DaysSince | Target),
data = group_data, family = ”binomial”, weights = group_data$Length)

The first line of the formula can be interpreted as follows: we make a call to the function glmer (generalized linear mixed-effects, used for logistic regression) and specify that the dependent variable (letterpercent) is predicted to be a linear combination of the independent variables Schedule, Session, DaysSince, Frequency, and Length. The * indicates an interaction between Schedule and Session, i.e., testing the hypothesis that the effect of training (Session) may differ across schedules (clustered vs. distributed). The second and third lines specify the random-effects structure; in this analysis, we have included random intercepts and slopes for both the effect of treatment and all time-related variables, by-individuals and by-items.8 The last line specifies that the dependent variable should be modeled as binomially distributed, and that the weights for each trial are equal to the length of the word.

The function coef() can be wrapped around the function summary() to return the results of the model for the fixed-effects, as follows (see Table 4):

coef(summary(model1))

Effect size measures

As has already been stated, the beta coefficients themselves provide an effect size measure. For logistic regression, these betas are log-odd ratios, and so are standardized measures of effect size. In this model, model1, they represent the increase in the odds of correctly spelling a word per training session. As presented in Table 4, the beta for Session = 1.316 can be exponentiated to give the odds ratio: e1.316 = 3.73. This odds ratio of 3.73 indicates a 273% increase in the odds of a correct trial, per unit increase of the predictor. In this model, one unit of the predictor Session corresponded to approximately 8 actual training sessions (because the measure was standardized), and so equates to 273/8 ≈ 34% improvement per single training session in the odds of a correct trial. Odds can be transformed into more-familiar probabilities via the formula P=OddsOdds+1. In this model, e.g., an individual who had 50/50 odds of producing a correct letter during a trial in Session 1 would improve to (50 × 1.34) = 67/50, or 1.34, on Session 2. Those odds are furthermore interpretable as a probability 1.341+1.34=57%.

An alternative effect size measure can be obtained in the form of a generalized R2 measure, that is, the amount of variance explained. The function r.squaredGLMM() from the MuMIn package can be used for this purpose. It provides two measures of variance explained: marginal (fixed-effects only) and conditional (fixed plus random-effects). One can compare the amount of variance explained by the full fitted model to that explained by a reduced model to determine how much variance a single predictor explains. For example, we can determine how much variance is explained overall by the effect of Session; first, how much variance is explained in model1?

r.squaredGLMM(model1)

We can then fit a reduced model that removes the fixed-effect of Session, and attribute the difference in the amount of explained variance to the effect of Session:

model1_reduced <- glmer(letterpercent ~ Schedule + DaysSince + Frequency + Length + (1 + Schedule + Session + DaysSince | Individual) + (1 + Session + DaysSince | Target), data = group_-data, family = ”binomial”, weights = group_data$Length)
r.squaredGLMM(model1_reduced)

Calculating LRT p-value

The default p-value returned from the summary of the glmer() function is the Wald-Z statistic. An alternative is to conducted a likelihood ratio test (LRT) using the built-in anova() function. As with the approach for isolating the amount of variance explained by a particular (set of) predictors, the LRT relies on model comparison. For example, to calculate a p-value for the interaction of Session and Schedule, another reduced model can be fit that is identical to model1 only without the interaction term; this model and the original are then called together with the anova() function, returning the following output (Table 5):

Table 5.

LRT for interaction term in model1. AIC = Akaike Information Criterion, BIC = Bayesian Information Criterion, logLik = log likelihood, Chi Df = chi-squared degrees of freedom, Pr (>Chisq) = p-value.

Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
model1_no_interaction 22 4421.7 4547.4 −2189 4378
model1 23 4420 4551 −2187 4374 4.207 1 0.040
model1_no_interaction<- glmer(letterpercent ~ Schedule +Session + DaysSince + Frequency + Length + (1 + Schedule + Session + DaysSince | Individual) + (1 + Session + DaysSince | Target), data = group_data, family = ”binomial”, weights = group_data$Length)
anova(model1, model1_no_interaction)

This output returns for each model the degrees of freedom (Df), the Akaike and Bayesian Information Criterion measures (AIC and BIC), the log likelihood (logLik) and deviance (dev), and the Chi-squared value based on the degrees of freedom (here, 1, because of the 1 additional degree of freedom used by adding the interaction term to the model). Most importantly, the last column gives the p-value of the Chi-squared test, here testing the interaction between Schedule and Session and calculated to be 0.04. This compares to the Wald-Z p-value returned by the summary() function of 0.026. See Appendix 2 for the results of a Type I error simulation study that confirms the greater Type I error inflation for the Wald-Z p-values relative to the LRT.

Visualizing results

The effects calculated in the fitted model can be visualized by using the “effects” package. Figure 2 depicts all of the effects included in model1 and can be accessed using the following commands:

model1_effects <- allEffects(model1)
plot(model1_effects)

A note on pre/post or follow-up designs

Pre-test/post-test designs or follow-up designs can also be analyzed with LMEM. For example, in the dysgraphia treatment study reported here and analyzed as model1, individuals returned for follow-up tests where they spelled each of their training words. To determine whether their improved spelling performance was retained, a separate LMEM could be computed, using the scores from the final training session as post-training scores and comparing them to the follow-up scores. The primary difference between such a model and the ones presented here is that with only two time points, the time point variable (which previously was “Session”) becomes a categorical variable9 encoding whether an observation was associated with post-training or follow-up time points Aside from that, the same method can be used, still without averaging items so that factors associated with them (like word frequency) can be accounted for. In such a model, the main effect of the time point would indicate if there was a significant change in performance after the end of training; the interaction Time × Schedule would indicate whether this change was significantly different between treatment schedules.

Analysis 2: effect of treatment on reaction time

The second analysis demonstrates how to apply the same approach used in Analysis 1 to a continuous outcome variable, in this case reaction time (RT). The main difference relevant to the tutorial is that instead of logistic regression and the function glmer(), continuous variables are modeled with regular (i.e., non-logistic) regression and the function lmer().

Model Formula and Description

The dependent variable is RT and only on trials where the individual spelled the word correctly. This results in a reduced number of data points, and is the typical practice for RT analyses. It is worthwhile noting that because of the reduced number of data points for the RT analysis, some of the individuals have fewer data points than is needed to include random-effects for the effects of Schedule, Session, and DaysSince—this is one advantage of analyzing the individuals as a group, because the total number of observations combining across all five individuals is sufficient to include all of these random-effects. The model formula is identical to that used for model1 in Analysis 1, except the dependent variable (log) RT, and we call the function lmer() instead of glmer(). For this linear regression, we also do not specify the family as binomial nor do we specify the weights for the trials.

model2<-lmer(log(timeperletter) ~ Schedule*Session + DaysSince + Frequency + Length
+ (1 + Schedule + Session + DaysSince | Individual)
+ (1 + Session + DaysSince | Target),
data = group_data)

The results of this model are presented in Table 6. With the lmerTest package loaded, the summary() function reports the p-values based on the Satterthwaite approximation for degrees of freedom.

Appendix 2. Type I error simulation

As indicated in the Introduction, there is discussion in the literature on the best method for estimating p-values in LMEM. The Wald-Z and LRT represent commonly used approaches that make different statistical assumptions and thus can yield different p-value estimates. In order to provide further understanding of this issue, we conducted an original simulation study to compare the Type I error rates associated with the Wald-Z and the LRT in the LMEM for Analysis 1. To do so, we generated artificial datasets by sampling from a distribution designed such that the average beta values for the variables of Session and the Schedule × Session interaction were zero, with standard deviations equal to those observed for the real dataset. We then subjected each of those datasets to LMEM analysis, evaluating the results using the Wald-Z and LRT approaches. If the Type I error of the statistical test equals 0.05, then in 1000 simulations we should obtain 50 (or fewer) p-values below 0.05. Table 7 reports the Type I error rates based on this simulation study: The probability of falsely finding a significant effect of Session or the interaction Schedule × Session when the effect (by design) is not significantly different from zero. These results reveal that for the main effect of Session, Type I error is near the desired criterion of 0.05, although it is somewhat higher for the Wald-Z compared to the LRT (0.064 versus 0.060). In the case of the interaction Schedule × Session, Type I error is highly inflated under the Wald-Z (0.116) but remains near 0.05 under the LRT (0.059). These results support that the Wald-Z method leads to inflated Type I error relative to the LRT, as has been suggested elsewhere (e.g., Moeyaert et al., 2014) and for this reason we recommend the use of LRT particularly in the case of SND and generalized LMEM (e.g., analysis of binomial data). See Appendix 1 (“Calculating LRT p-value”) for details on how to compute the LRT p-value.

Table 7.

Results of Type I error simulation for main effect (Session) and interaction (Schedule*Session) comparing the Wald-Z p-value and LRT p-value.

Session Schedule*Session
Wald-Z: 0.064 0.116
LRT 0.060 0.059

Type I error simulation procedure

A simulation-based procedure was used to evaluate the Type I error rate resulting from two methods for obtaining p-values from the logistic LMEM: The Wald-Z test, which is the default method in the lme4 package in R, and the LRT as outlined in Analysis 1. Briefly, we simulated 1,000 datasets based on the fitted model1 presented in Analysis 1, setting the beta coefficients for Session and the Schedule*Session interaction = 0. We then refit the model to each of these simulated datasets and obtained the p-values for both Session and Schedule*Session using each of the two p-value methods. The results of these simulations are presented in Table 7.

The results of the simulation indicate that for the fixed-effect of Session, the Wald-Z test performed near the desired Type 1 error level of 0.05, specifically 0.064. The LRT p-value for the fixed-effect of Session was also near the nominal Type 1 error level, slightly better than the Wald-Z test, specifically 0.060.

For the interaction effect of Session*Schedule, a noticeable difference was found between the two methods. The Wald-Z test resulted in decidedly inflated Type 1 error, with a rate of 0.116. In this case, the LRT clearly outperformed the Wald-Z, as it maintained performance near the level of 0.05, specifically 0.059. This result confirms what has been suggested elsewhere (Moeyaert et al., 2014), and provides evidence that the LRT is preferable to the Wald-Z test for logistic LMEM, at least in the case of SND.

Models fit by the lme4 package can be used to simulate hypothetical datasets by first extracting the relevant parameters. For example, they can be obtained from logistic LMEM analyzing accuracy, model1:

model1.newparams <- list(
beta = fixef(model1),
theta = getME(model1, “theta”))

These parameters can be selectively edited to assess Type I error (or statistical power) for various effect sizes by editing the values of the parameters of interest. For example, to assess Type I error for the effect of Session and the interaction Session*Schedule in this model:

model1.newparams$beta[“Session”]<−0
model1.newparams$beta[“Session:Schedule”]<−0

This sets the mean of the sampling distribution for these betas to zero. To assess the Type I error rate, hypothetical datasets are then generated from these new parameters, and models are refit to the simulated data:

nsim <- 1000
dv.sim<-simulate(model1, nsim = nsim, newparams = model1.newparams)

The number of simulations (“nsim”) determines the number of new sets of simulated dependent variables, each of which is stored in a column of “dv.sim”. Models are then fit to each of these simulated sets of observations with the same model formula used for the original model1. The desired output(s) of each model should be stored, this can be done, e.g., using a for loop:

model1.sim <- list();
for (i in 1:nsim){
group.data$sim<-model1.sim[[i]]<-refit(model1,dv.sim[[i]])
Each model and its summary will be saved in lists can then be accessed to evaluate the parameter (s) of interest.

Footnotes

Notes

1.

The distinction between a repeated-measures design and a time series is that while the former provides multiple observations of the same variable at the same time point, the latter provides only a single observation per time point.

2.

Nesting the effect of items within individuals would create separate effects for each word in each individual, e.g., allowing “house” to show a steeper training effect overall in one individual but a shallower effect in another individual. In the analysis we present, by crossing items with individuals, we assume instead that any differences in the effect of training the word “house” not accounted for by other variables, should be common across all individuals included in the study.

3.

Type I error is the rejection of the null hypothesis when it is actually true, i.e., incorrectly finding an effect as significantly different from zero at some α threshold (typically 0.05) when it is not. In a treatment study this would correspond to a false positive—reporting an effect as significant when it is not.

4.

To be clear, it is statistical independence of the errors that is a basic assumption of linear regression, not of the dependent variable itself. It is natural for data points observed from the same individual across time to show a relationship, due to any number of underlying factors. The problem of autocorrelation arises if this relationship is not accounted for by the independent variables in the model and/or a statistical method of control, and thus remains in the residual errors.

5.

Barr et al. (2013) clarify that “if a factor [i.e., a fixed-effect] is a between-unit factor, then a random intercept is usually sufficient” (p. 275). The point here is that if any given unit (e.g., any specific item) only takes on one value for some predictor, then a random slope for that predictor should not be included (by definition, a slope cannot be estimated based on only one value). Thus, e.g., a random slope for word frequency does not make sense by-items, as word frequency only varies between words, not within.

6.

Random slopes were included for all variables that related to time (i.e., Session and DaysSince) both by-items and by-individuals, and for Schedule by-individuals, on the basis that they are the critical variables of interest. A random slope for Schedule by-items was not included because no item appeared in both schedules (i.e., each word was either on a Clustered or a Distributed schedule but never both).

7.

The choice of modeling a small number of individuals as a group or as separate individuals, depends on both theoretical and practical questions. Theoretically, if the interest is in estimating an effect for an average individual, then the participants should be modeled as a group. If instead the interest is in measuring the effects specific to each individual, then each can be modeled separately. There may, however, be insufficient data for individual LMEM models (e.g., a single observation per individual per time point), but sufficient data when combining the individuals into a group.

8.

The random-effects structure in model1 is of crossed random-effects for participants and items. The alternative, nested random-effects, for participants and items would be modeled as (1 + Schedule + Session + DaysSince | Target:Individual).

9.

We recommend in this situation using sum-coding (i.e., coding “post” as −1 and “follow-up” as +1) as opposed to treatment or dummy-coding (coding as 0 and 1).

Disclosure statement

No potential conflict of interest was reported by the authors.

References

  1. Baayen RH, Davidson DJ, & Bates DM (2008). Mixed-effects modeling with crossed random-effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. doi: 10.1016/j.jml.2007.12.005 [DOI] [Google Scholar]
  2. Baek EK, & Ferron JM (2013). Multilevel models for multiple-baseline data: Modeling across-participant variation in autocorrelation and residual variance. Behavior Research Methods, 45(1), 65–74. doi: 10.3758/s13428-012-0231-z [DOI] [PubMed] [Google Scholar]
  3. Baek EK, Moeyaert M, Petit-Bois M, Beretvas SN, Van Den Noortgate W, & Ferron JM (2014). The use of multilevel analysis for integrating single-case experimental design results within a study and across studies. Neuropsychological Rehabilitation, 24(3–4), 590–606. doi: 10.1080/09602011.2013.835740 [DOI] [PubMed] [Google Scholar]
  4. Barr DJ, Levy R, Scheepers C, & Tily HJ (2013). Random-effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. doi: 10.1016/j.jml.2012.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bartoń K (2009). MuMIn: Multi-model inference. R package version 1. 40. 4 http://r-forge.r-project.org/projects/mumin/.
  6. Bates D, Maechler M, Bolker B, & Walker S (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. doi: 10.18637/jss.v067.i01 [DOI] [Google Scholar]
  7. Beeson PM (1999). Treating acquired writing impairment: Strengthening graphemic representations. Aphasiology, 13(9–11), 767–785. doi: 10.1080/026870399401867 [DOI] [Google Scholar]
  8. Borckardt JJ, & Nash MR (2014). Simulation modelling analysis for small sets of single-subject data collected over time. Neuropsychological Rehabilitation, 24(3–4), 492–506. doi: 10.1080/09602011.2014.895390 [DOI] [PubMed] [Google Scholar]
  9. Brysbaert M, & New B (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. doi: 10.3758/BRM.41.4.977 [DOI] [PubMed] [Google Scholar]
  10. Buchwald A, & Rapp B (2009). Distinctions between orthographic long-term memory and working memory. Cognitive Neuropsychology, 26(8), 724–751. doi: 10.1080/02643291003707332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Burns MK (2012). meta-analysis of single-case design research [special issue]. Journal of Behavioral Education, 21(3). doi: 10.1007/s10864-012-9158-9 [DOI] [Google Scholar]
  12. Campbell JM (2004). Statistical comparison of four effect sizes for single-subject designs. Behavior Modification, 28(2), 234–246. doi: 10.1177/0145445503259264 [DOI] [PubMed] [Google Scholar]
  13. Campbell JM, & Herzinger CV (2010). Statistics and single subject research methodology In Gast DL (Ed.), Single Subject Research Methodology in Behavioral Sciences(pp. 91–109). New York, NY: Routledge. [Google Scholar]
  14. Cepeda NJ, Pashler H, Vul E, Wixted JT, & Rohrer D (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380. doi: 10.1037/0033-2909.132.3.354 [DOI] [PubMed] [Google Scholar]
  15. Corballis MC (2009). Comparing a single case with a control sample: Refinements and extensions. Neuropsychologia, 47(13), 2687–2689. doi: 10.1016/j.neuropsychologia.2009.04.007 [DOI] [PubMed] [Google Scholar]
  16. Crawford JR, Garthwaite PH, & Howell DC (2009). On comparing a single case with a control sample: An alternative perspective. Neuropsychologia, 47(13), 2690–2695. doi: 10.1016/j.neuropsychologia.2009.04.011 [DOI] [PubMed] [Google Scholar]
  17. Crawford JR, & Howell DC (1998). Comparing an individual’s test score against norms derived from small samples. The Clinical Neuropsychologist, 12(4), 482–486. doi: 10.1076/clin.12.4.482.7241 [DOI] [Google Scholar]
  18. Davis DH, Gagné P, Fredrick LD, Alberto PA, Waugh RE, & Haardörfer R (2013). Augmenting visual analysis in single-case research with hierarchical linear modeling. Behavior Modification, 37(1), 62–89. doi: 10.1177/0145445512453734 [DOI] [PubMed] [Google Scholar]
  19. Dedrick RF, Ferron JM, Hess MR, Hogarty KY, Kromrey JD, Lang TR, & Lee RS (2009). Multilevel modeling: a review of methodological issues and applications. Review Of Educational Research, 79(1), 69–102. doi: 10.3102/0034654308325581 [DOI] [Google Scholar]
  20. Dunn LM, Dunn LM, Bulheller S, & Häcker H (1981). Peabody picture vocabulary test-revised. Circle Pines, MN: American Guidance Service. [Google Scholar]
  21. Evans JJ, Gast DL, Perdices M, & Manolov R (2014). Single case experimental designs: Introduction to a special issue of neuropsychological rehabilitation. Neuropsychological Rehabilitation, 24(3–4), 305–314. doi: 10.1080/09602011.2014.903198 [DOI] [PubMed] [Google Scholar]
  22. Ferron JM, Bell BA, Hess MR, Rendina-Gobioff G, & Hibbard ST (2009). Making treatment effect inferences from multiple-baseline data: The utility of multilevel modeling approaches. Behavior Research Methods, 41(2), 372–384. doi: 10.3758/BRM.41.2.372 [DOI] [PubMed] [Google Scholar]
  23. Ferron JM,. Farmer JL, & Owens CM (2010). Estimating individual treatment effects from multiple-baseline data: A monte carlo study of multilevel-modeling approaches. Behavior Research Methods, 42(4), 930–943. doi: 10.3758/BRM.42.4.930 [DOI] [PubMed] [Google Scholar]
  24. Gelman A, & Hill J (2006). Data analysis using regression and multilevel/hierarchical models(Vol. 1). New York, NY: Cambridge University Press. [Google Scholar]
  25. Goodman RA, & Caramazza A (1985). The Johns Hopkins university dysgraphia battery. Baltimore, MD: Johns Hopkins University. [Google Scholar]
  26. Gueorguieva R, & Krystal JH (2004). Move over anova: progress in analyzing repeated-measures data andits reflection in papers published in the archives of general psychiatry. Archives of General Psychiatry, 61(3), 310. doi: 10.1001/archpsyc.61.3.310 [DOI] [PubMed] [Google Scholar]
  27. Hanke JE, & Wichern DW (2005). Business forecasting (8th ed.). Upper Saddle River, N.J.: Pearson Prentice Hall. [Google Scholar]
  28. Hedges LV, Pustejovsky JE, & Shadish WR (2012). A standardized mean difference effect size for single case designs:AN EFFECT SIZE FOR SINGLE CASE DESIGNS. Research Synthesis Methods, 3(3), 224–239. doi: 10.1002/jrsm.1052 [DOI] [PubMed] [Google Scholar]
  29. Heyvaert M, Moeyaert M, Verkempynck P, Van den Noortgate W, Vervloet M, Ugille M, & Onghena P (2017). Testing the intervention effect in single-case experiments: a monte carlo simulation study. The Journal Of Experimental Education, 85(2), 175–196. doi: 10.1080/00220973.2015.1123667 [DOI] [Google Scholar]
  30. Howard D, Best W, & Nickels L (2015). Optimising the design of intervention studies: Critiques and ways forward. Aphasiology, 29(5), 526–562. doi: 10.1080/02687038.2014.985884 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hox J (1998). Multilevel modeling: When and why In Classification, data analysis, and data highways (pp. 147–154). Springer, Berlin, Heidelberg. [Google Scholar]
  32. Huber S, Klein E, Moeller K, & Willmes K (2015). Comparing a single case to a control group – applying linear mixed-effects models to repeated measures data. Cortex; a Journal Devoted to the Study of the Nervous System and Behavior, 71, 148–159. doi: 10.1016/j.cortex.2015.06.020 [DOI] [PubMed] [Google Scholar]
  33. Kay J, Lesser R, & Coltheart M (1992). PALPA: Psycholinguistic assessments of language processing in aphasia. Hove: Lawrence Erlbaum Associates Ltd. [Google Scholar]
  34. Kuffner TA, & Walker SG (2017). Why are P-Values Controversial? The American Statistician, 0–0. doi: 10.1080/00031305.2016.1277161 [DOI] [Google Scholar]
  35. Kuznetsova A, Brockhoff PB, & Christensen RHB (2017). lmerTest Package: Tests in linear mixed-effects models. Journal of Statistical Software, 82(13), 1–26. doi: 10.18637/jss.v082.i13 [DOI] [Google Scholar]
  36. Lane JD, & Gast DL (2014). Visual analysis in single case experimental design studies: Brief review and guidelines. Neuropsychological Rehabilitation, 24(3–4), 445–463. doi: 10.1080/09602011.2013.815636 [DOI] [PubMed] [Google Scholar]
  37. Lenz AS (2015). Using single-case research designs to demonstrate evidence for counseling practices. Journal of Counseling and Development, 93(4), 387–393. [Google Scholar]
  38. Maas CJ, & Hox JJ (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 86–92. doi: 10.1027/1614-2241.1.3.86 [DOI] [Google Scholar]
  39. Matuschek H, Kliegl R, Vasishth S, Baayen H, & Bates D (2017). Balancing type I error and power in linear mixed models. Journal of Memory and Language, 94, 305–315. [Google Scholar]
  40. Matyas TA, & Greenwood KM (1996). Serial Dependeny in In D. R, B. D, & Gorman BS(Eds.), Design and Analysis of SIngle-Case Research (pp. 215–243). Lawrence Erlbaum. [Google Scholar]
  41. Middleton EL, & Schwartz MF (2012). Errorless learning in cognitive rehabilitation: A critical review. Neuropsychological Rehabilitation, 22(2), 138–168. doi: 10.1080/09602011.2011.639619 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Middleton EL, Schwartz MF, Rawson KA, Traut H, & Verkuilen J (2016). Towards a theory of learning for naming rehabilitation: Retrieval practice and spacing effects. Journal of Speech Language and Hearing Research, 59(5), 1111. doi: 10.1044/2016_JSLHR-L-15-0303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Moeyaert M, Ferron JM, Beretvas SN, & Van den Noortgate W (2014). From a single-level analysis to a multilevel analysis of single-case experimental designs. Journal Of School Psychology, 52(2), 191–211. doi: 10.1016/j.jsp.2013.11.003 [DOI] [PubMed] [Google Scholar]
  44. Moeyaert M, Ugille M, Ferron JM, Beretvas SN, & Den Noortgate WV (2014). Three-level analysis of single-case experimental data: Empirical validation. The Journal of Experimental Education, 82(1), 1–21. doi: 10.1080/00220973.2012.745470 [DOI] [Google Scholar]
  45. Moeyaert M, Ugille M, Ferron JM, Beretvas SN, & Van den Noortgate W (2014). The influence of the design matrix on treatment effect estimates in the quantitative analyses of single-subject experimental design research. Behavior Modification, 38(5), 665–704. doi: 10.1177/0145445514535243 [DOI] [PubMed] [Google Scholar]
  46. Moeyaert M, Ugille M, Ferron JM, Onghena P, Heyvaert M, Beretvas SN, & Van den Noortgate W (2015). Estimating intervention effects across different types of single-subject experimental designs: empirical illustration. School Psychology Quarterly, 30(1), 50–63. doi: 10.1037/spq0000068 [DOI] [PubMed] [Google Scholar]
  47. Mycroft RH, Mitchell DC, & Kay J (2002). An evaluation of statistical procedures for comparing an individual’s performance with that of a group of controls. Cognitive Neuropsychology, 19(4), 291–299. doi: 10.1080/02643290143000150 [DOI] [PubMed] [Google Scholar]
  48. Owens CM, & Ferron JM (2012). Synthesizing single-case studies: A Monte Carlo examination of a three-level meta-analytic model. Behavior Research Methods, 44(3), 795–805. doi: 10.3758/s13428-011-0180-y [DOI] [PubMed] [Google Scholar]
  49. Parker RI, Vannest KJ, & Davis JL (2011). Effect size in single-case research: A review of nine nonoverlap techniques. Behavior Modification, 35(4), 303–322. doi: 10.1177/0145445511399147 [DOI] [PubMed] [Google Scholar]
  50. R Core Team. (2016). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. [Google Scholar]
  51. Rapp B (2005). The relationship between treatment outcomes and the underlying cognitive deficit: Evidence from the remediation of acquired dysgraphia. Aphasiology, 19(10–11), 994–1008. doi: 10.1080/02687030544000209 [DOI] [Google Scholar]
  52. Rapp B, & Kane A (2002). Remediation of deficits affecting different components of the spelling process. Aphasiology, 16(4–6), 439–454. doi: 10.1080/02687030244000301 [DOI] [Google Scholar]
  53. Raudenbush SW, & Bryk AS (2002). Hierarchical linear models: Applications and data analysis methods(Vol. 1). Sage. [Google Scholar]
  54. Roediger HL, & Karpicke JD (2006). The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science, 1(3), 181–210. doi: 10.1111/j.1745-6916.2006.00012.x [DOI] [PubMed] [Google Scholar]
  55. Sage K, Snell C, & Lambon Ralph MA (2011). How intensive does anomia therapy for people with aphasia need to be?. Neuropsychological Rehabilitation, 21(1), 26–41. doi: 10.1080/09602011.2010.528966 [DOI] [PubMed] [Google Scholar]
  56. Sakamoto Y, Ishiguro M, & Kitagawa G (1986). Akaike information criterion statistics. Dordrecht: D. Reidel. [Google Scholar]
  57. Shadish WR (2014). Analysis and meta-analysis of single-case designs [special issue]. Journal of School Psychology, 52(2), doi: 10.1016/j.jsp.2013.11.009 [DOI] [PubMed] [Google Scholar]
  58. Shadish WR, Kyse EN, & Rindskopf DM (2013). Analyzing data from single-case designs using multilevel models: New applications and some agenda items for future research. Psychological Methods, 18(3), 385–405. doi: 10.1037/a0032964 [DOI] [PubMed] [Google Scholar]
  59. Shadish WR, Zuur AF, & Sullivan KJ (2014). Using generalized additive (mixed) models to analyze single case designs. Journal of School Psychology, 52(2), 149–178. doi: 10.1016/j.jsp.2013.11.004 [DOI] [PubMed] [Google Scholar]
  60. Singer JD, & Willett JB (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press. [Google Scholar]
  61. Swaminathan H, Rogers HJ, Horner RH, Sugai G, & Smolkowski K (2014). Regression models and effect size measures for single case designs. Neuropsychological Rehabilitation, 24(3–4), 554–571. doi: 10.1080/09602011.2014.887586 [DOI] [PubMed] [Google Scholar]
  62. Thompson CK, Lukic S, King MC, Mesulam MM, & Weintraub S (2012). Verb and noun deficits in stroke-induced and primary progressive aphasia: The Northwestern naming battery. Aphasiology, 26(5), 632–655. doi: 10.1080/02687038.2012.676852 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Ugille M, Moeyaert M, Beretvas SN, Ferron J, & Van den Noortgate W (2012). Multilevel meta-analysis of single-subject experimental designs: a simulation study. Behavior Research Methods, 44(4), 1244–1254. doi: 10.3758/s13428-012-0213-1 [DOI] [PubMed] [Google Scholar]
  64. Van den Noortgate W, & Onghena P (2008). A multilevel meta-analysis of single-subject experimental design studies. Evidence-based Communication Assessment and Intervention, 2(3), 142–151. doi: 10.1080/17489530802505362 [DOI] [Google Scholar]
  65. Van Den Noortgate W, & Onghena P (2003). Hierarchical linear models for the quantitative integration of effect sizes in single-case research. Behavior Research Methods, Instruments, & Computers, 35(1), 1–10. doi: 10.3758/BF03195492 [DOI] [PubMed] [Google Scholar]

RESOURCES