1 Introduction

Statistics is a course in higher education (HE) that students often have trouble learning (Förster et al., 2018; Schwerter, Wortha et al., 2022; Vaessen et al., 2017) and are consequently affected by statistics anxiety (Condron et al., 2018). This is of serious practical concern as statistics is part of the curriculum of many university subjects (Garfield & Ben-Zvi, 2007). Research also indicates that many beginning students have severe difficulties in thinking statistically and face several misconceptions about statistics (Förster et al., 2018). Therefore, it is important to improve statistics learning concept to support students to counteract their learning difficulties. One possibility to improve student learning is the usage of e-learning tools with retrieval practice, video teaching, and similar formats, which have gained relevance in HE (Förster et al., 2018, 2022; Graham et al., 2013; Schwerter, Wortha et al., 2022; Velde et al., 2021). Research and academic literature evaluating this new way of teaching have been growing accordingly (Anthony et al., 2020; Castro & Tumibay, 2021). Recently, learning analytics has become a major trend in HE research (Hellings & Haelermans, 2020). From the many possibilities, this study focuses on students’ retrieval practice as it is one of the most robust and efficient methods in learning science (Yang et al., 2021).

As the literature reports, practicing helps people acquire and apply skills more confidently (Jonides, 2004). To make the most out of students’ practice time investment, we focus on the most effective learning techniques: Retrieval practice with corrective feedback and variability. The retrieval practice effect has been proven to be one of the most robust results in memory research in cognitive psychology (Karpicke, 2017), both in laboratory (e.g., Karpicke & Blunt, 2011; Lim et al., 2015) and in a few real educational settings (Förster et al., 2018; Roediger et al., 2011; Schwerter, Dimpfl et al., 2022). With the help of increased retrieval practice during the semester, students can easily reflect on whether they are achieving their study goals and monitoring their learning progress. By reviewing their performance on these self-tests, students can reflect on their achievements and identify areas for improvement. Thereby, this approach can support students to develop their self-regulation skills and it empowers students to take charge of their learning by making informed decisions about their study habits (Alexander et al., 2011; Azevedo, 2009; Butler & Winne, 1995). Thus, retrieval practice with direct feedback can serve as a powerful tool for promoting self-regulated learning (Ifenthaler et al., 2023).

However, evidence on the interplay of retrieval practice, spacing behavior, and task variability in real educational settings with problem-solving exercises is missing. Accordingly, in this study we analyzed additional retrieval practice through weekly voluntary online exercises. We examined whether participating in weekly voluntary e-learning exercises with different versions and free choice of when to work on these exercises helps students achieve higher grades at the end of the semester. We observed N = 67 students participating in an e-learning environment accompanying an advanced statistics course (on inference statistics) over a whole semester. This third-semester course is designed for undergraduate social science students at a large public university in Germany. To address the challenge of self-selection, we used important predictors of student achievement (such as prior achievement, motivation, personality, and time preferences) as control variables, applying a double-feature selection method (Belloni et al., 2014) to avoid overfitting. This approach corresponds to the call for future research to include affective prerequisites (Förster et al., 2018). Our study thus aims at contributing to the literature on retrieval practice by taking a closer look at students’ usage of voluntary online exercises in a real-life setting, at the same time controlling for important prerequisites.

1.1 Literature on Retrieval Practice and Related Concepts

Retrieval practice (or practice testing, self-testing), i.e., retrieving knowledge under study without any stakes, is one of the most efficient learning techniques for later retention (Donoghue & Hattie, 2021; Dunlosky et al., 2013; Yang et al., 2021). It is a study technique requiring the student to set aside the learning material and try to recall information from memory. This applies desirable difficulties (Bjork, 1994), i.e., it imposes challenging conditions on students, consequently requiring higher cognitive engagement. Although this initially seems to slow down the learning process, it improves later retention and transfer (Roediger III & Karpicke, 2006; Yan et al., 2014). Both are of particular importance in real education settings like university courses as topics within one course and courses within a study program build upon each other. Accordingly, knowledge learned at the beginning (such as statistics) is needed to understand the material at the end of studying. Retrieval practice improves delayed retention compared to re-reading (Roediger III & Karpicke, 2006), note-taking (McDaniel et al., 2009), verbal and visual elaboration of material (Karpicke & Smith, 2012), as well as using concept maps (Karpicke & Blunt, 2011; Lechuga et al., 2015).

Moreover, several studies have highlighted how this retrieval effect can be enhanced. For example, retrieval practice can be more effective by giving learners tasks of higher difficulty requiring comprehension and application rather than just memorizing discrete facts (Jensen et al., 2014). Regarding the difficulty level, it is unclear whether students must perform well during retrieval practice. Higher success in practice phases improved the retrieval effect (Racsmány et al., 2020). However, others have shown that performance in retrieval practice is not essential (Butler et al., 2017; Schwerter, Dimpfl et al., 2022). Additionally, the feedback literature shows that making errors does not harm but helps learners (Butler et al., 2011; Hays et al., 2013; Kornell et al., 2009). For example, Butler and Roediger (2008) found that feedback enables learners to correct incorrectly stored information. Due to the feedback, answers that could not be retrieved were not discarded from the memory (Kornell et al., 2011; Mundt et al., 2020; Wong & Lim, 2022). Feedback can even correct mistakes made with high confidence, also called hypercorrection (Butler et al., 2011). Thus, the students’ practice performance might not be crucial if the retrieval practice is accompanied by corrective feedback. Only if the retrieval practice exercises are too difficult, the retrieval practice may be harmful to students learning (Carpenter et al., 2016; Karpicke et al., 2014).

Another option to enhance the retrieval practice effects is spaced learning, i.e., repeated retrieval distributed over time (Rawson et al., 2015). Spacing out the learning over a more extended period is more beneficial for students than cramping before deadlines (Cepeda et al., 2006; Dempster, 1989). Additionally, spaced-out learning over a more extended period is better than cramming before a test because memory traces are reinforced through repetition, also known as the forgetting curve effect (Murre & Dros, 2015). The positive impact of retrieval practice and spacing on learning has been shown in many studies (Baker et al., 2020; Rodriguez et al., 2021a, 2021b)—even independent of prior performance (Rodriguez et al., 2021a, 2021b). The combination of both approaches is particularly helpful for students (Rodriguez et al., 2021a, 2021b; Roediger III & Karpicke, 2006).

Additionally, it seems advisable in retrieval practice to not use the same question repeatedly but to use different questions targeted at the same learning goal (Butler et al., 2017). In a study in geological sciences, Butler et al. (2017) demonstrated that increasing the variability improves student learning as students can faster transfer their knowledge to new examples of the same concept. One reason for this might be that variability helps students to distinguish the critical features from interchangeable information to better identify the concept being learned (Butler et al., 2017).

Although most literature on retrieval practice used rather simple test materials for measurement like single words, word pairs, text passages, and academic facts (Carpenter, 2012; Su et al., 2020), more challenging outcomes of understanding and comprehension of complex, educationally-relevant learning contents are now also investigated (Butler, 2010; Carpenter, 2014; Karpicke & Aue, 2015). Similarly, the literature expanded from showing improved recognition, cued recall, and free recall (Su et al., 2020) as well as transfer of factual and conceptual knowledge (Butler, 2010; Chan et al., 2006), to the promotion of superior critical evaluation of research articles (Dobson et al., 2018), analogical-problem-solving performance using hypothesis-testing examples (Wong et al., 2019), and to promoting deep conceptual learning in scientific experimentation skills (Tempel et al., 2020). However, in statistics, a topic in which solving exercises are a natural and widely used practice, the retrieval practice effect is seldomly analyzed. One notable exception is a field study using quizzing as retrieval practice in HE (Förster et al., 2018). The quizzes, used during the semester in a statistics class, included multiple-choice questions. If the students participated in the quizzes, their exam performance at the end of the semester improved. Similarly, but for mathematics in HE and using (mostly) open-end questions, Schwerter, Dimpfl et al. (2022) showed that more retrieval practice in mathematics led to more exam points at the end of the semester, depending on students’ motivation, personality, time preferences, and prior achievement. In these two studies, it is unclear whether the retrieval practice using multiple-choice or open-end questions improved students’ knowledge or whether the (combination of) testing (and feedback on the testing) encouraged spaced learning during the semester. Therefore, more research is needed to clarify whether a retrieval effect is observed in the studies or whether retrieval practice led to more spaced-out learning.

1.2 Prediction of Student Achievement in Higher Education

Exam grades prediction is a prevalent topic in empirical research. This study also contributes to this literature as it includes a variety of predictor variables. Based on conceptual considerations, relevant theoretical, and empirical work related to students performance in higher education, we focused on student information (Benden & Lauermann, 2022), self-set course goals (van Lent & Souverijn, 2020), expectancy-value beliefs (Eccles et al., 1983), achievement goals (Elliot & McGregor, 2001), the Big Five personality traits (Digman, 1990), and time preferences (Frederick & Loewenstein, 2002). For example, student information like prior achievement, employment responsibility and students’ gender are essential predictors for exam grades (McKenzie & Schweitzer, 2001; Paechter et al., 2010; Schwerter, Wortha et al., 2022).

Regarding students’ motivation, operationalized by students’ achievement goals, there are mixed results on the effect of students’ level of mastery and performance approach on exam performance (Elliot et al., 1999; Harackiewicz et al., 2002; Plante et al., 2013; Yperen et al., 2014). Exam performance seems to have a negative association solely with mastery and performance avoidance (Baranik et al., 2010; Hulleman et al., 2010; Payne et al., 2007). Moreover, the relationship between motivation and performance can be demonstrated employing students’ expectancy, value, and cost beliefs (e.g., Bailey & Phillips, 2016; Krause et al., 2012; Macher et al., 2015; Marsh & Martin, 2011; Wigfield & Eccles, 2000). Even though achievement goals and expectancy-value theory are related measures of student motivation, Plante et al. (2013) show that explanatory power is increased when variables from both concepts are included. Particularly for the case of e-learning, Dunn and Kennedy (2019) have shown that intrinsically motivated learners are diligent in completing e-learning exercises, while extrinsically motivated learners complete them more frequently.

In addition to motivation, the literature has documented the high importance of the Big Five personality traits on academic success (Komarraju et al., 2009; Rimfeld et al., 2016; Sorić et al., 2017). Last, concerning students’ time preferences, i.e. their inclination to prioritize immediate or future benefits, Bisin and Hyndman (2020) have shown that risk-averse students outperform risk-taking students in exams. Further, similar to Plante et al. (2013) in the context of motivation, Becker et al. (2012) underscored that time preferences complement personality traits, and that both contribute to a better explanation of educational achievement. Since these variables serve as control variables in our study, we refer the reader to the cited literature for further details on each concept.

1.3 Present Study & Research Questions

To address the research gap mentioned, we give students weekly retrieval practice exercises in a statistics class and measure their effect on students’ exam performance. Contrary to the two studies most similar, Förster et al. (2018) and Schwerter, Dimpfl et al. (2022), we let students decide when to use this additional online learning opportunity. In comparison, Förster et al. (2018) allocated students a whole week to solve 4 or 5 (depending on the semester of the data collection) weekly quizzes, while in Schwerter, Dimpfl et al. (2022), there was a constrained 60-min window on a specific day allocated to students to solve three practice tests. The key distinction in the present study lies in students’ autonomy when to work on the exercises, allowing us to observe varying spacing behavior and to examine whether the retrieval effect persists irrespective of spacing during the semester. This is a novel approach not previously explored. Furthermore, the students were offered multiple versions of the same exercises, enabling students to practice the same topic, using different exercise versions. This should enhance the retrieval effect due to exercise variability (Butler et al., 2017). In contrast to Schwerter, Dimpfl et al. (2022) but in line with Förster et al. (2018), we refrained from providing any incentive for engaging in retrieval practice exercises, primarily because retrieval practice is considered a low-stakes practice opportunity. Offering incentives like extra credit points for the exam could have increased students’ pressure or even been an inducement to cheat. Additionally, these external incentives might undermine intrinsic motivation (Deci et al., 2001). Hence, with our study design, we further contribute to the literature on retrieval practice opportunities as part of a university course. Lastly, as we observe students in a statistics course in HE, we also contribute to the general retrieval practice literature on applying knowledge to solve novel (target) problems using complex educational materials. The educational material is complex because it is composed of high interactivity of different and interconnected information elements (Karpicke & Aue, 2015; Wong et al., 2019). Analogous problem-solving requires procedural knowledge and successive execution of rules to apply an algorithm to solve a new task (Wong et al., 2019).

Our study corresponds to the call for future research (Carvalho et al., 2022; Förster et al., 2018; Reeves & Lin, 2020; Schwerter, Dimpfl et al., 2022; Wong et al., 2019; Yang et al., 2021) in four ways. (i) We assess problem-solving with exercises in which students do not need to recall the solution but learn the steps to arrive at the solutions and calculate the answer rather than stating whether a hypothesis testing decision is true or false, i.e., knowing how to solve a problem rather than knowing the solution. (ii) We check the difference between spaced-out learning in a HE course in comparison to cramping before the exam with regard to students’ exam performance. (iii) We include affective preconditions. (iv) Lastly, we conduct a field analysis in a HE gateway statistics course to increase the ecological validity of laboratory research. We were particularly interested in a statistics class because abundant literature has shown that statistics is a course which many students find troubling to master in HE (Vaessen et al., 2017) and are consequently affected by statistics anxiety (Condron et al., 2018). The specific research questions are as follows.

  • RQ1: Do students use the e-learning exercises even though they are voluntary, and no external rewards are given (RQ1a)? When students practice, do they space or cramp the exercises (RQ1b)? Do students only self-test one weekly exercise once or do they have multiple tries per week to make use of the exercise variability? (RQ1c)

  • RQ2: Do the weekly retrieval practice (RQ2a), spacing (RQ2b), and multiple tries per week (RQ2c) result in more exam points?

  • RQ3: Are the effects of retrieval practice, spacing, and multiple tries per week on exam points robust when controlling for demographic information, prior achievement, expectancy-value variables, achievement goals, personality traits, and time preferences? Or does the effect vanish once the additional controls are included, and hence the effect in RQ2 is only driven by selection?

While different studies highlight that students seldomly use retrieval practice to study (Susser & McCabe, 2013), Förster et al. (2018) and Schwerter, Dimpfl et al. (2022a) showed that students in statistics and mathematics courses in HE do use voluntarily practice opportunities. Thus, we expect that at least some students will use our retrieval practice opportunities. Furthermore, given that students are likely to procrastinate (Baker et al., 2019), we expect that most students have cramped rather than spaced-out their learning. In line with previous research (e.g. Tullis & Maddox, 2020), we also expect that most students will only do one try per week and not multiple tries per week. Next, following Förster et al. (2018), we expect an unconditional practice effect. Finally, following Schwerter, Dimpfl et al. (2022), we expect to find a lowered but still significant conditional retrieval practice effect. However, as this was not studied before, the effects of the free choice to space or cramp on students’ practice are unclear which highlights the need for further evidence from authentic HE settings.

2 Methods

2.1 Course Information

The topics of the course, Social Science Statistics 2, are inference statistics. This course builds on the course Social Science Statistics 1 from the preceding semester and spans 15 weeks, with 13 lectures. The lectures are accompanied by a weekly tutorial session with mandatory attendance in which tutors present solutions to the problem sheets. If the students miss more than two sessions, they cannot take the exam at the end of the semester. Thereby, the requirement is not to miss a tutorial, not whether they are prepared or actively participate. A general overview of the course topics and respective dates during the semester can be seen in Appendix Table 8.

Then, at the center of the research design, students can practice the week’s topic with the help of e-learning exercises. These exercises cover one to three weekly tutorial exercises with the same frame or wording as those in the tutorial but with new examples, following the concept of variability (Butler et al., 2017). The number of exercises depends on the respective length and difficulty.

The official exam took place at the end of the semester. The exam was divided into a first and second trial, with the first being the main trial. The first trial took place one week after the end of the lectures, and the second trial would have occurred one week before the new semester started. However, due to the COVID-19 pandemic, the second trial was postponed several weeks into the next semester. Because of this unique situation, we exclude it in the analysis.

2.2 Design of E-Learning Exercises

This study aimed to investigate whether participating in additional e-learning exercises enhanced students’ achievement in inference statistics. The exercises were provided weekly and voluntarily with direct automatic corrective feedback in the online management system of the university. Additionally, students saw how many points they earned at the end of the exercises. This direct feedback guided them on which topics required further attention.

Within the e-learning exercises, students mostly needed to apply or transfer knowledge from the lectures by calculating exercises. There were also some multiple-choice questions to avoid open-ended questions.

The e-learning exercises were uploaded weekly, but it was up to the students to decide if, when, and in which order they worked on the e-learning exercises. If students crammed, they could work on all e-learning exercises in the last week or final days before the exam. Students were further allowed to retake the exercises as often as they wanted to improve their performance or refresh their memory right before the exam.

Additionally, each e-learning exercise had five different versions, i.e., students who repeated exercises did not necessarily receive the same exercise. Participating in the e-learning exercises was not connected to an additional external reward. We refrained from using external incentives because they may have undermined intrinsic motivation (Deci et al., 2001).

The duration of time students could work on each exercise was limited by a timer. Thereby, we wanted to ensure that students focused on the exercises. Additionally, the timer also resembled the setting of the exam. However, students had twice as much time to work with their learning material compared to the exam.

2.3 Participants

Data were collected in 2019 during the second of two mandatory statistics courses for social sciences students at a large German public university. Data collection was restricted to students who took the exam at the end of the semester. About 80 students had registered for the exam, but only 67 ultimately took the exam. Of these students, 53 answered the survey (at least partly), summarized in Table 1. More than half of the students were female (58%).

Table 1 General sample information

2.4 Data

We collected the survey information within the first week of the course with an online survey. The data were saved when students worked on the e-learning exercises on the online-study website ILIAS. When students solved the same weekly exercise again, only the best attempt was saved. Table 2 shows the descriptive statistics of the variables we employed.

Table 2 Descriptive statistics for the outcome, variables of interest, and demographic information

For the analysis, the outcome variable was the number of points earned in the final exam. The maximum number of points on the exam was 90. The best student earned 87 points, and the passing cut-off was 40. The retrieval practice variable was the sum of the weekly e-learning exercises in which students participated. Therefore, the mean of 5.20 shows that, on average, students worked on five to six e-learning exercises out of 13. Performance was assessed by mean points per exercise of the sessions the students self-tested (performance). For mean number of trials per week, we measured how often students repeated a specific exercise of any week. Although we designed five different versions for each weekly exercise, most students used only one version. Then, for spacing, we summed the number of times students worked on the e-learning exercises within the first two weeks of their respective publications. The mean of spacing below one shows that only a fraction of students spaced their learning, and most crammed it in the last week before the exam.

Next, we also collected the self-reported preparation for the weekly face-to-face tutorials. At the beginning of these face-to-face tutorials, students had to sign a list to prove attendance. When students signed the list in the tutorial, we additionally asked them, on a scale from 1 to 4, to what extent they had prepared themselves for the tutorial (1 = not at all, 4 = fully prepared). Thus, the variable face-to-face tutorial preparation was the mean preparation of students (between zero and four) over the 13 tutorial weeks. Then, the attendance rate (missed face-to-face tutorials) measured the number of tutorials students missed, which were two to three on average. Within the sample, some students retook the exam, and thus the mean of the dummy Retaking Statistics 2 was slightly above zero. The mean high school GPA was about 2.6 (in Germany, the GPA ranges from 1, best, to 4, worst). For a subject-specific ability measure, we used the performance on the course Social Science Statistics 1, which students should have taken the semester before. We standardized the number of points for the specific exam date to make it comparable. Further, we included a variable indicating whether an individual had not yet passed the exam or whether they had not yet taken the exam.

Additionally, to the abovementioned variables, we asked students about their self-set goals for the practice and exam. We asked how many of the e-learning exercises they planned to solve, whether they planned to take them weekly, how well they wanted to perform on them, and what grade they aimed to earn on the final exam. Students wanted to complete seven to eight e-learning exercises at the beginning of the semester. Lastly, on average, students aimed for a 2.3 on the exam.

Finally, we surveyed standard measures of expectancy-value theory (Gaspard et al., 2017, adapted to the university context and course), achievement goals (Elliot & Murayama, 2008, translated and adapted for the specific context), the big five personality traits (Schupp & Gerlitz, 2014, taken as is) and present bias preferences (Frederick & Loewenstein, 2002, translated). Summary statistics and the Cronbach’s α are presented in Table 3. The respective measures are further described in Tables 10 and 11. Only for the big five personality traits, Cronbach’s α was below 0.7 but still above 0.6 for some constructs.

Table 3 Descriptive statistics for additional control variables

3 Statistical Analysis

OLS regression was performed to estimate the relationship between practice and exam points:

$$points_{i} = \mu + \rho^{\prime } practice_{i} + \beta_{1}^{\prime } Char_{i} + \beta_{2}^{\prime } EVT_{i} + \beta_{3}^{\prime } AG_{i} + \beta_{4}^{\prime } BF_{i} + \beta_{5}^{\prime } PBP_{i} + \beta_{6}^{\prime } SG_{i} + \varepsilon_{i} ,$$
(1)

where index i stands for the individual and \(\varepsilon_{i}\) is the idiosyncratic error term. The outcome variable \(points_{i}\) is the number of points of the exam. To estimate the students’ practice behavior, the vector \(practic{e}_{i}\) included a (sub) set of the practice variables introduced in Table 2. Confounders may have influenced the practice variables. For example, motivation might have increased in additional practice and exam points. Thus, the practice variables might have not only measured the practice effect but also included the underlying motivation. Therefore, we included variables in \(Cha{r}_{i}, EV{T}_{i}, A{G}_{i},B{F}_{i,}, and \,S{G}_{i}\) as presented in Table 2.

However, we faced the problem of too many variables per student. Hence, we used variable selection methods to achieve a sufficient sparse set of control variables. We followed the double selection procedure introduced by Belloni et al. (2014) for this purpose. Their suggestion was a two-stage selection procedure: First, variables were selected that explained exam points and all practice variables. Thereby, we acquired a sparse set of essential variables from \(Cha{r}_{i}, EV{T}_{i}, A{G}_{i}, B{F}_{i,}, PB{P}_{i}, and \,S{G}_{i}\) explaining students’ exam points and practice behavior. Second, we ran an OLS regression that included the pre-selected variables and the practice variables on exam points. Assuming that the most important variables were surveyed in the first place, we interpreted the estimated practice coefficients cautiously causally. We followed Belloni et al. (2014) and used the machine learning method LASSO for the feature selection. LASSO selects variables by imposing the L1 penalty \(\lambda [{\Sigma }_{i}(|{\beta }_{i}|)]\) on the regression coefficients. This penalty sets some coefficients to be exactly zero, effectively removing the corresponding predictors from the model. The amount of shrinkage applied to the coefficients is controlled by the \(\lambda\) tuning parameter. We follow the standard rule and use cross-validation to find the \(\lambda\) that is 1 standard error higher than the minimizing \(\lambda\) to prevent the model from over-fitting (Friedman et al., 2001). By setting the coefficients to zero, LASSO is useful for situations where there are many predictors and only a subset of them is relevant. However, it can struggle with highly correlated predictors. The Elastic Net combines LASSO and Ridge Regression to overcome LASSO's difficulty with highly correlated predictors. It is a hybrid of these two methods, including additionally the Ridge penalty \(\lambda [{\Sigma }_{i}({\beta }_{i}{\prime}{\beta }_{i})]\), also called L2-penalty. Like LASSO, Elastic Net can generate models with zero coefficients, resulting in sparse selection. However, it also incorporates the penalty of ridge regression, which helps handle situations with highly correlated variables (Hastie et al., 2009). The Elastic Net equation is as follows:

$$points_{i} = \mu + \rho^{\prime } practice_{i} + \beta_{1}^{\prime } Char_{i} + \beta_{2}^{\prime } EVT_{i} + \beta_{3}^{\prime } AG_{i} + \beta_{4}^{\prime } BF_{i} + \beta_{5}^{\prime } PBP_{i} + \beta_{6}^{\prime } SG_{i} + \varepsilon_{i} + \lambda \left[ {\sum_{i} \left( {\left( {1 - \alpha} \right) \beta_{i}^{\prime } \beta_{i} + \alpha\left| {\beta_{i} } \right|} \right)} \right] ,$$
(2)

in which \(\lambda\) is the penalty weight and \(\alpha\) is the weight for either Ridge (L2 normalization: \({\beta }_{i}{\prime}{\beta }_{i}\)) or LASSO penalization (L1 normalization: \(|{\beta }_{i}|\)). Hence, Eq. (2) reduced to LASSO with \(\alpha =1\) and to Ridge if \(\alpha =0\). For the selection, we chose a value α = {1, 0.8, 0.6, 0.4, 0.2}, i.e., using LASSO as well as elastic with varying mixture between LASSO and elastic net. We did not choose α = 0 because Ridge does not help select a sparse set of variables.

For the post-double selection regressions, we used multiple imputations to include all 67 observations as some students did not respond to all questions in the survey. Therefore, we used 100 imputations and then pooled the results. While the standard in educational literature using R is the package mice (Lüdtke et al., 2017), we used classification and a tree-based method. Akande et al. (2017) and Murray (2018) reasoned against using mice PMM because it is too inflexible and recommended using tree-based methods instead, especially for mixed data types and non-linear interactions between variables, as well as to cope with high-dimensional data. Further, Madley-Dowd et al. (2019) showed that using tree-based methods reduces bias even when the proportion of missing values is large. Therefore, we applied missForest (Stekhoven & Bühlmann, 2012) to all variables. MissForest is a random forest-based imputation method. It is used to handle missing values in data sets containing different types of variables. By averaging over unpruned classification or regression trees, missForest performs multiple imputations. It estimates the imputation error using the out-of-bag error estimates of random forest, eliminating the need for a test set (Stekhoven & Bühlmann, 2012). Descriptive statistics did not reveal any differences between the sample of complete and incomplete cases. Tables comparing both samples are available upon request.

4 Results

4.1 Participation in Exercises and Correlates

First, the results of the correlation analyses are shown. Figure 1 illustrates the relationship between five practice variables and exam grades. The correlation between (the number of) retrieval practice attempts and (the mean) performance (in the retrieval practice attempts) was high \((r = 0.73)\) because students who never participated in the e-learning exercises also had no performance and we did not observe students with high participation and low performance in these exercises. This was similar to the number of trials that correlated with retrieval practice attempts \((r = 0.75)\) and the respective performance \((r = 0.67)\). The correlation between spaced learning and retrieval practice was \(r = 0.59\), between spaced learning and performance was \(r = 0.40\), and between spaced learning and the mean number of trials per week was \(r = 0.45\).

Fig. 1
figure 1

Correlation plot

Note: The diagonal shows the distribution of the respective one-dimensional distributions. The lower half shows the two-dimensional scatterplot, and the upper half shows the correlation \(^{*} p < 0.05, ^{**} p < 0.01, ^{***} p < 0.001\)

The correlation between face-to-face tutorial preparation and retrieval practice \(\left(r = 0.08\right),\) performance \((r = 0.16)\) and spacing \((r = 0.29)\) was very low and insignificant, and even negative for the mean number of trials per week \((r = - 0.03)\). All practice variables were positively correlated with exam points, whereas only the mean number of trials per week was statistically insignificant.

Regarding RQ1, we found that about two-thirds of the students who attended the exam used the e-learning exercises to practice for the exam (RQ1a). Most students, however, used the e-learning exercises to repeat the topics right before the exam and did not space out their learning (RQ1b). Figure 1 also provides initial evidence of the positive relationship between more retrieval practice and spaced learning and exam performance (RQ2). Within the next section we focus on the practice variables retrieval practice attempts, spaced learning and mean FTF tutorial preparation, as these variables are the most important predictors in multivariate regressions (see Table 12). Additionally, Table 13 shows that the results are also robust when we control for the selection into using the e-learning exercises at least once. Since the regression results are robust, we exclude this variable in the subsequent analysis.

4.2 Effects of Retrieval Practice Variables on Exam Performance

Table 4 presents the regression results for the practice variables on the end exam points without any additional control variables. The first column includes only the variable retrieval practice attempts to show whether the retrieval practice with several e-learning exercises predicts more points in the end exam. The coefficient is equal to 1.917 and is highly significant. Thus, students who practiced one additional e-learning session increased their points on average by around 2 points. Since there were 13 sessions, students with full participation improved their grades by 24.92 points, equaling more than one entire grade. In column (2), we include the mean FTF tutorial preparation, which reduces the retrieval practice attempts coefficient to 1.695 to proxy students’ offline practice behavior. Next, the mean of the number of trials per practice of the participated e-learning sessions in column (3) does not substantively change the regression, and the coefficient itself is statistically insignificant. This missing significance is likely due to the very low variation already reported in Table 1. Lastly, once we include students’ spacing during the semester in column (4), the retrieval practice attempts coefficients decrease again slightly to 1.239. Additionally, the coefficient for spaced learning is equal to 2.866 and significant at the 10 percent level. Working on one additional e-learning exercise within two weeks after publication would increase the exam points by almost 3 points. Since the adjusted R2 is highest for column (4), which included retrieval practice attempts, mean FTF tutorial preparation, and spaced learning, we will focus on these practice variables from now. The estimated coefficients above should be interpreted cautiously because they could be biased due to omitted variables. Therefore, we add additional control variables in the following subsection. Since spaced learning is additional information on how students self-tested, we will first look at post-double selection regression without spacing in Table 5 and include spacing in Table 6.

Table 4 Main practice regression—sequential inclusion of practice variables
Table 5 Post-double selection regression results without spacing
Table 6 Post-double selection regression results with spacing

4.3 Post-double Selection Regression Results

In the second step, the results of the post-double selection regression analyses were reported. The selected control variables are each a subset of the selected variables of the column to the left. This means that LASSO selected most variables, and the subsequent Elastic Net picked a subset of these variables. Introducing additional control variables in Table 5 columns (1) to (5) showed a reduced but stable effect between 1.25 and 0.99 for the retrieval practice attempts. Thus, the retrieval practice effect was almost halved but still robustly statistically significant and important, even after including a rich set of control variables. Furthermore, the adjusted \(R\)2 increased from 0.285 in Table 4 column (2) without covariates up to 0.610 in Table 5 column (2). Thus, the covariates explained a slightly higher amount of the variance in the dependent variable than that in the practice variables. This high adjusted \(R\)2 in Table 4 is further reassurance that we captured important variables explaining exam grades, making it less likely that the estimated effect was driven solely by unobserved selection.

For the FTF tutorial preparation effect, including the control variables led to a meaningful change. The effect decreased to 2.917 in column (5) and was no longer significant. In contrast to the retrieval practice, this implies that the estimation in Table 4 column (2) was upward-biased and partly explained by our rich set of control variables. The preparation might still be beneficial, but the effect was driven by a selection into more preparation.

Next, we re-ran the post-double selection regression, including the spacing variable in the post-regression. The results are presented in Table 6. First, for the retrieval practice attempts, the coefficient decreased to between 0.624 in column (4) and 0.809 in column (2) and was only significant at the 10% level (except column 3). Thus, one additional weekly e-learning self-test would only yield an increase of around 0.7 or 0.8 points. However, the reduction in the retrieval practice coefficient due to the spacing variable might have occurred for two reasons. The first is that retrieval practice is necessary for spacing. Thus, the variable spacing captures part of the retrieval practice effect as depicted by the high correlation shown in Fig. 1. Without the retrieval practice, there would not been any spacing in our model. Second, the number of observations might have been too small for both practice variables and additional control variables. Additionally, the spacing coefficient was between 2.046 (column 5) and 3.598 (column 3), significant at the 5% or 10% level. Therefore, we conclude that retrieval practice with the help of our weekly e-learning exercises is helpful and even more so if students’ practice is spaced out during the semester.

Table 7 shows which variables were selected by the respective elastic nets, which directions they had and their significance level. Prior achievement measured by the standardized grade for Statistics I, self-concept, utility value, performance-avoidance, conscientiousness, and neuroticism are always selected for all specifications. The standardized grade for Statistics I, utility value and mastery approach, retaking Statistics II, present-bias, and openness also have a particularly high predictive power, shown by a robust significant effect, in the specifications they are selected in. The results support that these variables are complements rather than substitutes, as they are each selected. This is also in line with Plante et al. (2013) for EVT and achievement goals, as well as Becker et al. (2012) for personality and time preferences. More specifically, prior achievement in Statistics I was always selected and always had a positive statistically significant relation with exam performance. Retaking Statistics II and openness, if they were selected, also had a positive statistically significant relation. While students’ utility value was always selected, it demonstrated a positive statistically significant relation with exam performance only in three out of five instances. Students’ mastery approach and present bias preferences exhibited a negative statistically significant relation with exam performance in all post-double selection regression except for α = 0.2. Additionally, the negative statistically significant relation of the aspired grade in the exam in two specifications means that students who aspired a better (= lower) grade had a better exam performance.

Table 7 Feature selection results

It is important to note that deviations from the literature could be driven by the high number of control variables. Although not the primary scope of this analysis, it is possible to conduct regressions per variable group to examine if all variables go into the expected direction. For example, when considering the negative impact of mastery approach, when also including students’ self-concept and prior achievement, it may be because students with high self-concept and high prior achievement also have a high mastery approach. In such cases, self-concept and prior achievement might account for most of the explanatory power, leaving the remaining explanatory power of mastery approach to exhibit a negative effect. This could mean that students with lower self-concept and lower prior achievement who are still determined to master every topic may have lower exam points.

5 Discussion, Limitations and Conclusion

This study analyzed the effect of voluntary, non-rewarding e-learning exercises on students’ exam points at the end of the semester. In a university inference statistics course, additional exercises were offered to undergraduate social science students to practice the topics of the lectures and tutorials. Students’ practicing behavior was analyzed with regard to the frequency and spacing of the usage of these exercise. Our study results highlight the potential of e-learning tools in higher education teaching. Particularly, we found that taking part in additional e-learning exercises improves students’ achievement. In contrast to most studies in this area, which were solely based on surveys (O’Brien & Verma, 2019), we added to the few examples of Förster et al. (2018) and Schwerter, Dimpfl et al. (2022) who additionally collected data within e-learning environments. Thus, we made use of the new possibilities of learning analytics to improve teaching and learning (Hellings & Haelermans, 2020).

Our study provided some proof for the saying ‘practice makes perfect’ in a natural educational environment to the extent that students benefited from more retrieval practice attempts in the additional e-learning exercises. Most students used our designed e-learning practice (RQ1a); however, in line with the literature on procrastination (Bisin & Hyndman, 2020), only a fraction of students spaced out their use of the exercises (RQ1b). Students also rarely used different versions of the same exercise, but practiced a self-test only once per weekly topic (if at all), which is in line with current literature showing that additional practice opportunities are rarely used (Tullis & Maddox, 2020) (RQ1c). For future research, it would be worthwhile to investigate whether this selective usage of retrieval practice depends on students’ self-estimated abilities to remember the information. Earlier research showed that learners do not continue to study the best-learned information, but instead restudy the worst learned content; however, they selectively test the learned content (Karpicke, 2009).

The positive effect of retrieval practice attempts on exam performance (RQ2a) confirms general e-learning practice literature for lower-order learning, using quizzes (Collins et al., 2018; Landrum, 2007; Panus et al., 2014), and higher-order learning (Förster et al., 2018; Schwerter, Dimpfl et al., 2022), as well as the general retrieval practice literature (Baker et al., 2020; Hartwig & Dunlosky, 2012; Park et al., 2018; Rodriguez et al., 2021a, 2021b). More specifically, retrieval practice with one additional weekly e-learning exercise increased the student’s final exam points in our study by 1 to 2 points. Our results do not only confirm Förster et al.’s (2018) study, they also extend it by including various important predictors of student achievement (such as motivation, personality traits, time preferences and goals). After including these control variables, the effects of additional practice were reduced but remained statistically significant and of importance (RQ3). Overall, our results confirm that with the help of digital technology, in particular, online quizzes, students can learn more efficiently and effectively (Morrison & Anglin, 2005) and are most likely to retrieve more information (Roediger III & Karpicke, 2006). The results are particularly interesting as social science students are known to have trouble with statistics (Vaessen et al., 2017). They could, thus, also be used to design interventions to counter-act statistics anxiety.

Several factors are likely to drive the positive effect of retrieval practice on exam performance. First, practice leads to a more efficient encoding of the information to be retrieved, stored and/or recalled (Jonides, 2004; Roediger III & Karpicke, 2006). Second, experiencing knowledge gaps can lead to additional learning to fill these gaps (Karpicke, 2009). This additional learning results in the potentiation effect (Hays et al., 2013). Third, even if students failed to recall how to solve the problem correctly, students might learn from the error they made when solving the respective exercise (Kornell et al., 2009). Third, in this study, students also received knowledge of correct response feedback immediately after each self-test. This most likely added to the positive effect of the retrieval practice since research on feedback has demonstrated its effect in correcting errors or misconceptions (Hattie & Timperley, 2007; Wisniewski et al., 2020) and its general effect on student achievement (Attali, 2015; Attali & van der Kleij, 2017). Thus, the feedback likely increased the error generation effect (Kornell et al., 2009) as students already learned about their mistakes immediately after the self-test. Additionally, feedback may have also helped guide students in their learning (Kirschner et al., 2006).

Finally, our results showed that students who spaced out the self-tests had an additional benefit in their learning, i.e., one additional exercise done within the respective week yielded around three additional points (RQ2b). This can be explained by students’ deeper processing of the content, particularly if their learning was spread out over the whole semester (Collins et al., 2018; Jonides, 2004). Especially at the beginning of the semester, doing the additional exercises might help students follow the upcoming weeks’ topics, explaining this large effect. This also relates to the error generation effect (Kornell et al., 2009) mentioned above. Students who spaced out their learning might have benefited more from the following lectures as they built on the past topics. Also, the weekly topics built on each other, which is why practicing during the semester also meant some repetition of topics from earlier weeks. In this regard, it is noteworthy that “using retrieval practice selectively for well-learned information, rather than for all information, may be the most effective use of retrieval practice because benefits of testing occur only when students successfully retrieve information” (Tullis & Maddox, 2020 p. 140). Students benefited from the forgetting curve, i.e., the repetition of earlier study topics helped reinforce memory traces. However, the interpretation of the spacing effect is limited since only one student managed to work on 6 of 13 exercises within the first two weeks after publication. The relatively low spacing realizations, in turn, align with students’ well-known procrastination behavior in HE (Denny et al., 2018). Altogether, our results showed that the survey and intervention results in retrieval practice and spacing can be transferred to natural educational settings in which additional e-learning exercises indirectly promote spacing.

Our results of selected covariates are mostly in line with the literature as follows: Prior achievement (Förster et al., 2018; Rodriguez et al., 2021a, 2021b), utility (Brisson et al., 2017; Gaspard et al., 2017; Wigfield & Eccles, 2020), openness (Ziegler et al., 2018), and higher exam goals (van Lent & Souverijn, 2020) are important positive predictor for students exam achievement. Additionally, as higher present-bias preferences are known to explain students probability to procrastinate (Bisin & Hyndman, 2020), we find a negative relation to exam performance. Lastly, we add to the mixed results of the direction of mastery approach: Contrary to Elliot et al. (1999) and Harackiewicz et al. (2002) but in line with Plante et al. (2013), we find a negative effect of mastery approach.

This study is limited given the relatively small sample. Also, we observed only one cohort of social science students in one university, which, simultaneously, meant that we did not have a teacher or an institutional effect that needed to be controlled. Nevertheless, although the external validity was already enhanced by the natural setting of the study, more research with a larger sample is needed to replicate the results. A larger sample would also enable us to better estimate the effects of retrieval practice attempts, the number of trials per week, the respective performance, and its development and spacing. Furthermore, we only measured students’ preparation for the weekly face-to-face tutorial, which was supposed to capture students’ non-digital learning behavior. However, this variable was only self-reported and did not capture additional learning outside the e-learning environment. The literature shows that measurement error is a potential problem in student self-report measures. However, it is more likely to occur when students provide sensitive information such as GPAs (Wilson & Zietz, 2004) and when responding to items that address the main topic of the survey (Brenner & DeLamater, 2017). We would argue that (a) the question whether students prepared for the face-to-face tutorial is not sensitive, and (b) it was not the primary focus of the survey. Therefore, we expect self-reported measurement error to be low in this context. In our setting, there was no possibility of assessing it in any other way.

There are two concerns when interpreting the positive effects of retrieval practice. Given our design and the ethical requirements, we could not conduct an RCT study that would have allowed some students access to the retrieval practice exercises while not allowing others to do so. Further, we could not distinguish the testing-enhancing effect and the respective feedback. Future studies could use an RCT to determine the importance of feedback when students self-test. Altogether, we add to the literature on retrieval practice, which thus far has mainly used surveying or promoting study techniques: e-learning exercises that promote retrieval practice and include feedback that help students learn and results in higher achievement. Although limited by contextual constraints, such as the lack of a traditional control group due to ethical concerns, the study demonstrated the added value of additional e-learning exercises in a natural, i.e., ‘noisy’, setting. Given the problems of replicating experimental research, it is important to show the robustness of the effects in different contexts.

Since only about two thirds of our students used the additional practice opportunities, it is noteworthy to reflect upon the practical implications of our research. We showed—in line with previous research—that additional practice improves students’ exam performance. Also spacing out the participation further enhances students’ learning. Thus, future (statistics) courses could be designed in such a way that students are motivated to (a) utilize the e-learning exercise and to (b) space out their learning to benefit from the potentiation effect. In our study, we refrained from providing further incentives for students to participate in the additional practice to avoid crowding out intrinsic motivation. Nevertheless, providing students with reasoning on why practice is important might already support their participation and reduce procrastination. Ideally, such e-learning exercises can also support faculty (or teaching assistants in higher education) and help them to support students in their learning.