Keywords

1 Introduction

Various studies have shown that doses to some cardio-vascular substructures may be critical factors in the observed heart toxicity and early mortality following radiotherapy (RT) for nonsmall cell lung cancer (NSCLC) [10, 14,15,16]. This may be attributed to irradiation of particular constituents of the cardio-pulmonary system. Currently, segmentation of cardio-pulmonary organs other than the whole heart and lung has been overlooked, and only these two organs are routinely defined as part of the treatment planning process. RT planning requires robust and accurate segmentation of organs-at-risk in order to maximize radiation to the disease location and to spare the normal tissue as much as possible. The introduction of a new set of organs puts requirements on both segmentation accuracy and segmentation time that would result in an overhead of several hours of manual segmentation and contour refinement in the clinic.

We built and validated a multi-label Deep Learning Segmentation (DLS) framework for accurate auto-segmentation of cardio-pulmonary substructures. The DLS framework utilized a deep convolutional neural network architecture to segment 12 cardio-pulmonary substructures [4] from Computed Tomography (CT) scans of 217 patients previously treated with thoracic RT. The segmented substructures are: Heart, Pericardium, Atria, Ventricles, Aorta, Left Atrium (LA), Right Atrium (RA), Left Ventricle (LA), Right Ventricle (RV), Inferior Vena Cava (IVC), Superior Vena Cava (SVC) and Pulmonary Artery (PA). We evaluate our framework using a hold-out dataset of 24 CT scans by calculating quantitative as well as qualitative validation metrics. A radiation oncologist qualitatively evaluated auto-generated contours for an additional set of 10 CT scans to determine that, on average, 85% of the non-overlapping substructure contours required no modifications and were acceptable for clinical use.

2 Methodology

Our approach utilizes deep neural network for 2D segmentation of contrast as well as non-contrast enhanced thoracic CT images. The network auto-crops input CT scans around the lungs to extract the region of interest. The network is trained to perform multi-label prediction of eight non-overlapping, contiguous substructures, which are: aorta, LA, LV, RA, RV, IVC, SVC and PA. Additionally it is individually trained to segment the overlapping structures such as the heart, the atria, pericardium and ventricles. Output label predictions for the multi-label segmentation network and overlapping structures were combined for each input scan, resulting in auto-segmentation of 12 cardio-pulmonary substructures.

2.1 Experimental Datasets

Experimental data consisted of CT scans of 241 patients obtained from our institutional clinic. This data consisted of contrast as well as non-contrast enhanced images of varying imaging quality and resolution across different scanners. Manual expert segmentation for 12 organs-at-risk cardio-pulmonary structures was considered ground truth and used for model training, testing and validation. 192 CT scans were utilized for model training, 24 CT scans were used for model testing and the remaining 24 CT scans were used for hold-out validation respectively. These scans were auto-cropped around the lungs to extract the volume of interest around the heart substructures. 2D axial slices pertaining to each patient image volume were resized to \(512\times 512\) and normalized, resulting in a total of 10,284 training images. Network input data was augmented per batch and consisted of random cropping, random horizontal and vertical flipping and rotation by ten degrees. Resulting auto-segmented 2D axial images were stacked back together to generate 3D segmentations without further post-processing.

An additional dataset of 10 RT planning thoracic CT scans, for which no expert contours were available, was used for qualitative contour evaluation by a radiologist to determine auto-generated contour acceptability for clinical use.

2.2 Network Architecture

Our approach, as depicted in Fig. 1 leveraged the deep neural network architecture of [1]. Convolutional neural networks (CNNs) and encoder-decoder neural networks have been successfully employed for medical image segmentation tasks [6, 7, 12, 13]. The Deeplab encoder-decoder network architecture with atrous separable convolutions consists of spatial pyramid pooling that encodes multi-scale contextual information to capture spatial anatomical information of contiguous structures. Dense feature maps extracted in the last encoder network path consist of detailed semantic information. The decoder network was able to robustly recover structure boundaries through bilinear upsampling at a factor of 4 while applying atrous convolutions to reduce features before semantic labeling. We trained the network using ResNet-101 [5] as the encoder network backbone with learning rate = 0.01 using “policy” learning rate scheduler [8], crop size = \(513\times 513\), batch size = 8, loss = cross-entropy, output stride = 16 for 50 epochs for dense label prediction. Our approach has been implemented using the Pytorch DL framework.

We also investigated the performance of various network loss functions and their influence on correct multi-label prediction. We trained our network with various segmentation losses on the same architecture backbone to account for varying structure sizes and class imbalance during training and determine the efficacy of modifying label prediction probabilities during back propagation for multi-label segmentation. The network was trained using cross entropy (CE), Multi-class Dice Loss (M-DSC), Generalized Dice Loss (G-DSC) [11] and a weighted combination of (0.5G-DSC + 0.5CE) which pixel-wise CE resulting in superior segmentation performance. Cross entropy loss can be described as

$$\begin{aligned} L(\chi ;\theta ) = - \sum \limits _{x\in \chi } \log p(t_i|x_i;\theta ), \end{aligned}$$
(1)

where X denotes the input images, \(p(t_i|x_i;\theta )\) is the pixel probability of the target class \(x_i\in \chi \) that is being predicted with network parameters \(\theta \). A quantitative comparison of auto-segmentation results using the aforementioned various network losses can be found in Table 3.

Fig. 1.
figure 1

Schematic overview of the proposed deep learning multi-label segmentation scheme. The network is trained on 2D CT images that are auto-cropped around the lung region of interest, augmented and batch normalized for dense label prediction.

Table 1. Criteria for scoring each cardiac substructure contour in clinical scoring, showing the number of CT image slices, and percentage of extended structures, that need to be unapproved due to minor modifications for each grade. NOA: Need of Adjustments.

2.3 Model Evaluation

We quantitatively evaluated the auto-generated segmentations by comparing the Dice Coefficient (DSC) and 95th Percentile Hausdorff Distance (HD95 (mm)) of 24 patients against expert clinical segmentations. Additionally, an expert qualitatively evaluated the auto-generated multi-label segmentations for an additional cohort of 10 thoracic CT scans to validate the clinical usability of the auto-contours. No expert contours were present for this additional validation dataset. The expert reviewed substructure contours on axial slices of the CT images and rated them on a four-grade score: Good (requiring no adjustments), Acceptable (acceptable auto-contour deviations), Need of Adjustments (NOA) and Poor (requiring larger number of slice adjustments). Rating was performed by listing the number of slices requiring contour adjustments in relation to the average number of slices spanning each substructure. Criteria for the clinical contour scoring is presented in Table 1.

3 Results and Discussion

Table 2 compares the DSC evaluation metric for segmentations using the network training loss cross-entropy (CE) against other network training losses. Our experiments demonstrated that pixel-wise target class loss calculation using CE resulted in improved multi-label segmentation predictions when compared against Multi-class Dice Loss (M-DSC), Generalized Dice Loss (G-DSC) and a weighted combination of (0.5CE + 0.5G-DSC). Although the DSC score for Aorta, which is the largest substructure during multi-label segmentation, is improved as expected using the G-DSC loss, the accuracy of smaller, tubular substructures was reduced.

Table 2. Multi-label segmentation comparison of eight cardio-pulmonary substructures between various network training loss configurations using the DSC evaluation metric. All training losses were implemented using the same network architecture and hyperparameters. Highest achieved accuracies are highlighted in bold.

Figure 2 displays the DSC Score results for the 24 hold-out validation CT images for all 12 substructures segmented using the CE loss. Our achieved DSC accuracies are comparable to the state-of-the-art multi-atlas [9] and deep learning methods [3] for segmenting cardio-pulmonary substructures from CT images. The highest segmentation accuracy was observed for the heart (median DSC = 0.96, median HD95 = 3.48 mm), while the remaining structures achieving median accuracy (0.81 \(\le \) DSC \(\le \) 0.94) and (6 mm \(\le \) HD95 \(\le \) 3 mm), with highest HD95 surface distance accuracy observed for Aorta. Figure 3 display the qualitative contour results comparing the DLS contours against expert contours.

Fig. 2.
figure 2

Dice Similarity Coefficient (DSC) Score results of the 24 thoracic RT CT images comparing the auto-generated DLS contours against the manually segmented expert contours for 12 cardio-pulmonary substructures.

Fig. 3.
figure 3

Comparison of the auto-generated DLS contours (depicted in blue) against the expert delineations (depicted in green) for two patients (a) and (b) in axial, sagittal and coronal plane views. The Aorta, PA and SVC are visible in Axial Slice 1, whereas the four chambers: LA, LV, RA and RV, and the Aorta are visible in Axial Slice 2. A: aorta. (Color figure online)

Table 3 displays the clinical contour evaluation scores of the auto-generated contours for 10 thoracic RT CT scans using the grading criteria described in Table 1. The expert identified all need for adjustments as minor modifications, with contours in acceptable ranges for the IVC, SVC, PA, LA and LV (median adjustments ranging between 5 to 15%). Most required adjustments were observed in the RV, with median 24% contours requiring modifications. Most of the minor adjustments were observed near the superior portion of the structures between CT image contour transitions. This may be attributed to the image artifacts introduced due to heart motion and image acquisition.

Table 3. Qualitative evaluation of auto-generated segmentations of 10 thoracic RT patient CT scans. A radiation oncologist expert determined the percentage of each auto-generated structure segmentation in Need of Adjustments (NOA). Expert identified all required changes as minor modifications. Least and most adjustments were required for SVC and RV structures, respectively, for clinical acceptance and use.

The qualitative scoring and quantitative evaluation for the aorta and IVC is lower than expected because both these substructure segmentations were extended on image slices beyond the clinical contouring protocol. According to the clinical contour guidelines, these two substructures should not be contoured beyond two slices below the last contoured image slice of the heart in the axial plane. However, due to lack of training on a large set of background CT images in the posterior portion of the heart contour during network training, our model continued to segment the aorta and IVC because of the presence of the substructure edges beyond the heart contour. This highlighted the consideration towards additional spatial input data requirements during network training for generating clinically acceptable auto-segmentations as input to radiation treatment planning.

4 Conclusion

We propose a model for auto-segmentation of cardio-pulmonary substructures from contrast and non-contrast enhanced CT images. The proposed model reduced substructure segmentation time for a new patient from about one hour of manual segmentation to approximately 10 s. We demonstrated that the model is robust against variability in image quality characteristics, including the presence or absence of contrast. We validated our approach by quantitatively comparing resulting contours against expert delineation. An expert concluded that overall 85% of the auto-generated contours are acceptable for clinical use without requiring adjustments. The resulting segmentations can effectively be utilized to study the effect of heart toxicity and clinical outcomes, as well as used as input to radiation therapy treatment planning. We have applied our approach to auto-segment an additional 283 treatment planning CT scans to study heart toxicity outcomes for lung cancer. The developed cardio-pulmonary segmentation models have being integrated into deep learning tools within the open-source CERR [2] platform.