Introduction

Gliomas are the most frequently occurring primary tumor of the brain [1]. Accurate segmentation of gliomas on clinical magnetic resonance imaging (MRI) scans plays an important role in the quantification and objectivation of diagnosis, treatment decision, and prognosis [2,3,4]. In current clinical practice, T1-weighted, post-contrast T1-weighted, T2-weighted, and T2-fluid attenuated inversion recovery (FLAIR) sequences are required to characterize the different components and to assess the infiltration of the surrounding brain parenchyma [5, 6]. Glioma segmentation requires the distinguishing of tumor tissue from healthy surrounding tissues by the radiologist [7] and the segmented region of interest or volume of interest can be used to compute feature-based radiomics and quantifiable measurements [8, 9]. However, segmentation is a time-consuming task with high inter-observer variability [10, 11]. Therefore, automatic segmentation methods have been searched for as these could facilitate consistent measures and simultaneously could reduce time spent on the task by radiologists in their daily practice. These developments have been powered by the organization of the annual multimodal Brain Tumor Segmentation (BraTS) challenge (http://braintumorsegmentation.org/). Within the BraTS challenges, the organization committee released multimodal scan volumes of a relatively large number of patients suffering from glioma after which different research groups aim to construct machine learning algorithms (MLAs) to automatically segment the gliomas. The BraTS data were accompanied by corresponding segmentations which served as the ground truth [11]. Recent developments in automatic segmentation by the use of MLAs helped to achieve higher precision [12]. Within the BraTS challenges, the MLAs which yielded the most accurate results included different 2D and 3D convolutional neural networks (CNNs) [13,14,15,16,17], including 3D U-Nets [18, 19].

Despite the large body of scientific literature covering this topic, a comprehensive overview and meta-analysis of the accuracy of MLAs in glioma segmentation is still lacking [20, 21]. Therefore, factors which enable the further development of MLAs for glioma segmentation remain partially elusive. The aim of the current study therefore was to provide a systematic review and meta-analysis of the accuracy of MLA-based glioma segmentation tools on multimodal MRI volumes. By providing this overview, the strengths and limitations of this field of research were highlighted and recommendations for future research were made.

Methods

The systematic review and meta-analysis was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [22]. Prior to initiation of the research, the study protocol was registered in the international open-access Prospective Register of Systematic Reviews (PROSPERO) under number CRD42020191033.

Papers that developed or validated MLAs for the segmentation of gliomas were reviewed. Literature was searched for in MEDLINE (accessed through PubMed), Embase, and The Cochrane Library, between April 1, 2020, and June 19, 2020. No language restrictions were applied. The full search strings, including keywords and restrictions, are available in the Appendix. Studies describing MLA-based segmentation methodologies on MR images in glioma patients were included. Additional predefined inclusion criteria were as follows: (1) mean results were defined as dice similarity coefficient (DSC) score; (2) study results needed to be validated either internally and/or externally. Letters, preprints, scientific reports, and narrative reviews were included. Studies based on animals or non-human samples or that presented non-original data were excluded.

Two researchers screened the papers on title, abstract, and full-text independently. Discussions between both researchers were held to resolve all disagreements about non-consensus papers. The investigators independently extracted valuable data of the included papers using a predefined data extraction sheet after which the data was cross-checked. Data extracted from the included studies comprised the following: (a) first author and year of publication; (b) size of training set; (c) mean age of participants in the training set; (d) gender of participants in the training set; (e) size of internal test set; (f) whether there was an external validation; (g) study design, including the used MRI sequences and the segmentations which formed the ground truth; (h) architecture of the AI-algorithm(s); (i) target condition; (j) performance of the algorithm(s) in terms of DSC score, sensitivity, and specificity for both the training and the internal and/or external test sets. When studies performed external validation of the described AI-system(s), externally validated data were included in data extraction tables. Data from the internal validation were used when studies solely carried out the internal validation of the reported MLAs.

The quality of the included studies was not formally assessed, as a formal quality assessment is a well-known challenge in this area of research [23,24,25]. Nevertheless, Collins and Moons (2019) announced their initiative to develop a version of the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement tailored to machine learning methods [26]. Pinto dos Santos suggested on the European Society of Radiology website various items to take into consideration when reviewing literature regarding machine learning [27]. These items were included in this review.

Statistical assessment

An independent statistician was consulted to discuss the statistical analyses and approaches with regard to the meta-analysis. To estimate the overall accuracy of the current MLAs, a random effects model meta-analysis was conducted. To be included in the meta-analysis, studies needed to have reported the outcome of interest (i.e., DSC score), in combination with a standard deviation (SD), standard error (SE), and/or the 95% confidence interval (95% CI). For studies reporting the SE and/or the 95% CI, the SD was statistically assessed [28]. Meta-analysis was performed on aggregated data of all studies providing suitable outcomes. Then, subgroup analyses were conducted on two separate target conditions, for studies describing the segmentation of either HGGs or LGGs.

Statistical analyses were carried out by use of IBM SPSS Statistics (IBM Corp. Released 2017. IBM SPSS Statistics for Windows, Version 25.0. IBM Corp.). Variables and outcomes of the statistical assessment were presented as mean with ± SD when normally distributed. When data were not normally distributed, they were presented as the median with range (minimum–maximum). Statistical tests were two-sided and significance was assumed when p < 0.05.

The DSC score represents an overlap index and is the most used metric in validating segmentation images. In addition to the direct comparison between automated and ground truth segmentations, the DSC score is a common measure of reproducibility [29, 30]. The DSC score ranges from 0.0 (no overlap) to 1.0 (complete overlap). In this meta-analysis, a DSC score of ≥ 0.8 was considered good overlap. A DSC score of ≤ 0.5 was considered poor.

The quantitative meta-analysis was partially carried out using OpenMeta[Analyst] software, which is the visual front-end for the R package (www.r-project.org; Metafor) [31]. Forest plots were created to depict the estimated DSC scores from the included studies, along with the overall DSC score performance. When the 95% CI of the different subgroup analyses overlapped, no further statistical analysis was carried out.

The heterogeneity of the included studies was tested with the Higgins I2-test. The Higgins I2-test quantifies inconsistency between included studies, where a value > 75% indicates considerable heterogeneity between groups. A low heterogeneity corresponds with a Higgins I2 between 0 and 40% [28]. Both the meta-analyses of the aggregated groups as the meta-analyses of the subgroups were performed using a random effects model, due to an observed high heterogeneity (Higgins I2 > 75%) between included studies [32].

To showcase possible publication bias, a funnel plot was created by means of Stata (StataCorp. 2019. Stata Statistical Software: Release 16.: StataCorp LLC.).

Results

Initially, 1094 publications were retrieved through database searching. An additional ten publications were identified through cross-referencing. After removing duplicates, the remaining 734 publications were screened. Based on the title and abstract, 509 papers were excluded. A total of 225 full-text articles were assessed for eligibility and 42 studies were included in the systematic review. Ten studies were eligible for inclusion for the meta-analysis as they provided sufficient quantitative data (e.g., only these studies provided the DSC score along with SD for the performance of the MLA) (Fig. 1). Publications describing the use of (automated) segmentations to apply MLAs to classify molecular characteristics of gliomas (n = 135) were excluded. Fourteen papers were excluded as they described the use of MLAs on gliomas to perform texture analyses. Eleven papers did not report the DSC score and another 11 studies showed unclarities in data reporting. Contacting the authors of these papers did not result in the acquisition of the needed data. Five studies did not report results of internal or external validation steps, whereas an additional three studies did not report data from the training-group. Three studies described separate combined features, instead of a coherent MLA methodology. One study was excluded due to the inclusion of other brain tumors next to gliomas (e.g., metastases) (Fig. 1).

Fig. 1
figure 1

PRISMA flowchart of systematic literature search

Review of the included studies

Based on the full-text analysis, 42 segmentation studies [13, 17, 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72] were included for the systematic review, from which the participant demographics and study characteristics are depicted in Table 1. The used MLAs are presented in Table 1 and comprised different types of CNNs [13, 17, 34, 35, 37,38,39,40,41,42,43, 45,46,47, 49,50,51,52,53, 55,56,57, 60, 61, 63,64,65, 67] and random forest model [68,69,70], multiple classifier system [33, 44], and an adaptive superpixel generation algorithm [60]. In addition, one study used semi-automatic constrained Markov random field pixel labeling [64], one study used an end-to-end adversarial neural network [71], and one study used a 3D supervoxel-based learning method [56].

Table 1 Participant demographics, study characteristics, and outcomes of the included studies and performance evaluation of MLAs of the included studies

Thirty-eight studies combined different combinations of MRI sequences for brain tumor segmentation (Table 1) [13, 17, 33,34,35,36,37,38,39,40,41,42, 44, 45, 47,48,49,50,51,52,53,54,55,56,57, 59,60,61,62,63,64,65,66,67,68,69,70,71,72]. Only 3 studies used one MRI sequence for the algorithm to segment [43, 46, 58]. One conference paper did not report on the used MRI sequences [56]. Four studies reported not to have used (any part of) the BraTS datasets [36, 46, 50, 51]. Two of these papers used original data [46, 51]. The other two papers used either data from the Cancer Imaging Archive (TCIA) [50] or a combination of TCIA data and original data [36].

In 36 studies, the ground truth (i.e., segmentations) was derived from the BraTS dataset [13, 17, 33,34,35,36, 38,39,40,41,42,43,44,45, 47,48,49, 52,53,54,55, 57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]. In two of these studies, the researchers added segmentations of additional original data. Segmentations were manually annotated by two experienced professionals independently following the BraTS segmentation protocol[54, 64]. In one paper, only original data with corresponding segmentations were used. These segmentations were made independently by two experienced professionals following the BraTS segmentation protocol [51]. Three papers used segmentations which were obtained without adhering to the BraTS segmentation protocol [36, 46, 50]. In one conference paper, the segmentation methodology was not described [56]. Please note that the ground truth segmentations of BraTS 2015 were first produced by algorithms and then verified by annotators, whereas the ground truth of BraTS 2013 fused multiple manual annotations.

The performance of the MLAs, in terms of sensitivity, specificity, and DSC score, is displayed in Table 1. All studies used retrospectively collected data. Nine studies focused specifically on the segmentation of HGGs, whereas seven studies focused on the segmentation of LGGs. The remaining studies (n = 31) described the segmentation of gliomas in general without the subdivision of LGG and HGG. Five of the included studies [33, 35, 38, 62, 65] described segmentation of multiple target conditions (i.e., segmentation of both HGG and LGG). For these studies, the results of each different target are displayed in Table 1 as well. All of the included studies conducted some version of cross-validation on the MLAs; however, only four studies [35, 36, 51, 64] performed an external validation of performance.

Nine studies [33, 35, 36, 38, 51, 62, 64, 65, 72] described the segmentation of HGGs in particular, with four studies [35, 36, 51, 64] externally validating the performance of the reported MLAs. Performance evaluation of the included studies in terms of the validated DSC score ranged from 0.78 to 0.90. MLA sensitivity ranged from 84 to 85% (n = 3) [33, 51, 64]. Only one study [33] presented the specificity rate (i.e., 98%).

Seven studies [33, 35, 38, 46, 50, 62, 65] described the segmentation of LGGs. External validation of the MLA was performed by one study [35]. The validated DSC score for the included studies ranged from 0.68 to 0.85. Sensitivity was 89% (n = 2) [33, 46], whereas specificity was 98% (n = 1) [33].

Meta-analysis of the included studies

The aggregated meta-analysis comprised twelve MLAs, described in ten individual studies [33, 36, 44, 47, 51, 54, 58, 62, 66, 72], and showed an overall DSC score of 0.84 (95% CI: 0.82 – 0.86) (Fig. 2). Heterogeneity showed to be 80.4%, indicating that studies differed significantly (p < 0.001).

Fig. 2
figure 2

Forest plot of the included studies that assessed the accuracy of segmentation of glioma. Legend: DSC, dice similarity coefficient; CI, confidence interval. Forest plot shows that the performance of the MLAs to segment gliomas are centered around a DSC of 0.837 with a 95% CI ranging from 0.820 to 0.855

For the subgroup analysis of segmentation studies focusing on HGGs, the results are depicted in Fig. 3. Overall, DSC score for the five included studies [33, 36, 51, 62, 72] was 0.83 (95% CI: 0.80 – 0.87). The estimated I2 heterogeneity between groups showed to be 81.9% (p = 0.001). Two studies [33, 62] focusing on the segmentation of LGGs were included in another subgroup meta-analysis. Overall, the DSC score was found to be 0.82 (95% CI: 0.78–0.87) (Fig. 4). The estimated heterogeneity of included groups was 83.62% (p = 0.013). Hence, the heterogeneity was determined as high for both subgroup meta-analyses.

Fig. 3
figure 3

Forest plot of the included studies that assessed the accuracy of segmentation of high-grade glioma. Legend: DSC, dice similarity coefficient; CI, confidence interval. Forest plot shows that the performance of the MLAs to segment HGGs are centered around a DSC of 0.834 with a 95% CI ranging from 0.802 to 0.867

Fig. 4
figure 4

Forest plot of the included studies that assessed the accuracy of segmentation of low-grade glioma. Legend: DSC, dice similarity coefficient; CI, confidence interval. Forest plot shows that the performance of the MLAs to segment LGGs are centered around a DSC of 0.823 with a 95% CI ranging from 0.776 to 0.870

Publication bias

Studies included in the funnel plot were the ten studies that were meta-analyzed (Fig. 5). The funnel plot showed an asymmetrical shape, giving an indication for publication bias among included studies. Besides, not all studies were plotted within the area under the curve of the pseudo-95% CI, supporting the indication of possible publication bias [28].

Fig. 5
figure 5

Funnel plot of the included studies. Legend: DSC, dice similarity coefficient; CI, confidence interval. DSC score was displayed on the horizontal axis as the effect size; SE was plotted on the vertical axis of the funnel plot

Discussion

Various MLAs for the automated segmentation of gliomas were reviewed. Although heterogenous, MLAs showed to have a good DSC score with no differences between the segmentation of LGG and HGG. However, there were some indications for publication bias within this field of research.

Currently, segmentation of tumor lesions is a subjective and time-consuming task [58]. By replacing the current manual methods with an automated computer-aided approach, improvement of glioma quantification and subsequently radiomics can be achieved. However, automated segmentation of gliomas is a challenging task, due to the large variety of morphological tumor characteristics among patients [11]. As HGGs usually show more heterogeneous MRI characteristics, their automated segmentation could be expected to be more challenging compared to LGGs. Furthermore, the low proliferative state of LGGs likely results in lower perfusion and higher diffusion values in affected tissue [73, 74]. No performance difference was observed between the segmentation of HGGs and LGGs. Given the differences between HGGs and LGGs, it was expected that significant differences would arise in automatic segmentation tasks. Nevertheless, the ground truth segmentations were based on manual delineation by a (neuro)radiologist, indicating that the performance of automatic segmentation could only be as good as the ground truth segmentations. In addition, the ground truth of BraTS 2015 was first produced by algorithms and then verified by annotators, whereas the ground truth of BraTS 2013 fused multiple manual annotations.

Although MLAs performing automated segmentation show quite promising results (overall DSC score of 0.84; 95% CI: 0.82–0.86), there is still no wide acceptance and implementation of these methodologies in daily clinical practice. One of the explanations for this can be found in the different MLA methodologies; different MLA approaches and their exact details have a significant impact on the outcomes, even when applied to the same dataset. For example, in the BraTS 2019 challenge, the top three with regard to the segmentation task comprised a two-stage cascaded U-Net [75], a deep convolution neural network [76], and an ensemble of 3D-to-2D CNNs [77].

Another reason may be the absence of standardized procedures on how to properly use these segmentation systems. There are substantial differences between advanced systems that offer computer-aided segmentation and the current standards for neuroradiologists, which impedes the integration of MLA methods. CE-certified software is limitedly available in clinical practice, which is one of the reasons for the impediment. Also, the purpose for the use of MLAs varies; where radiologists mainly use these techniques for follow-up, neurosurgeons mostly use MLAs for therapeutic planning. In addition, direct integration into the neuroradiologist’s daily practice without extra time spent on the task will be needed to make automatic glioma segmentation feasible. Moreover, the current automated segmentations still need to be supervised by trained observers. It seems more likely that implementation of MLAs in neuroradiology will lead to an interaction between doctor and computer so that neuroradiologists will utilize more advanced technologies in the establishment of diagnoses [78]. The future implementation of MLAs in the diagnosis of glioma is of great clinical relevance, as these algorithms can support the non-invasive analysis of tumor characteristics without the need of histopathological tissue assessment. More specifically, automatic segmentations form the basis of further sophisticated analyses to clarify meaningful and reliable associations between neuroimaging features and survival rate [79, 80]. In conclusion, as automated segmentation of glioma is considered to be the first step in this process, the implementation of MLAs holds great potential for the future of neuroradiology.

Various publications were found with regard to the automated segmentation of gliomas in the post-operative setting [81,82,83,84]. Quantitative metrics are believed to be needed for therapy guidance, risk stratification, and outcome prognostication in the post-operative setting. MLAs could also represent a potential solution for automated quantitative measurements of the burden of disease in the post-operative setting. As shown in Table 2, however, the DSC scores of these studies are lower as compared to the DSC scores of the pre-operative MLA-based segmentations [81,82,83,84]. An explanation for these differences in performance could be the post-surgical changes of the brain parenchyma and the presence of air and blood products in the post-operative setting. Together these factors have been reported to affect the performance of MLAs [81].

Table 2 Overview of the studies on post-operative glioma segmentation

Several methodological shortcomings of the present meta-analysis should be considered. First, various studies were excluded for the quantitative synthesis, due to missing data. Besides, heterogeneity of all analyses was considerably high, probably caused by technical variances of different MLA methodologies for segmentation. Lastly, only four out of 42 studies performed an out-of-sample external validation, emphasizing the importance of external validation to assess the robustness. It is probable that publication bias was present as there is no interest in the publication of poorly performing MLAs. In addition, differences in MR sequence input, ground truth, and other variables could play a role with regard to the outcomes, although this was considered a minor limitation as the source data across studies was similar in most studies.

Future gains of research on this topic may include an ensemble approach, as this might significantly boost the performance of segmentation. Thus, in addition, to focus current research on training individual segmentation systems, it may be interesting to investigate the fusion of multiple systems as well (i.e., segmentation of different imaging features in order to obtain different imaging biomarkers) [11]. Lastly, all included studies used retrospectively collected data, most of which using data from the BRATS databases. In order to further validate the performance of segmentation systems in clinical practice, larger-scale and external validated studies are preferred. In addition, data availability and providing online tools or downloadable scripts of the used MLAs could enhance future developments within this field of research significantly.

Conclusion

In this study, a systematic review and meta-analysis of different studies using MLA for glioma segmentation shows good performance. However, external validation is often not carried out, which should be regarded as a significant limitation in this field of research. Therefore, further verification of the accuracy of these models is recommended. It is crucial that quality guidelines are followed when reporting on MLAs, which includes validation on an external test set.