Keywords

1 Introduction

Precision medical treatments including image-guided radiotherapy require accurate target tumor segmentation [1]. Computed tomography (CT), the standard-of-care imaging modality lacks sufficient soft-tissue contrast, which makes visualizing tumor boundaries difficult, especially for those that are adjacent to soft-tissue structures. With the advent of new MRI simulator technologies, radiation oncologists can delineate target structures on MRI acquired in simulation position, which then have to be transferred using image registration to the planning CTs acquired at a different time in treatment position for radiation therapy planning [2]. Image registration itself is prone to errors and thus accurate segmentation on CT itself is more desirable for improving accuracy of clinical radiation treatment margins. More importantly, driven by the lack of simultaneously acquired CT and MR scans, current methods are restricted to CT alone. Therefore, we developed a novel approach, called cross-modality educed deep learning (CMEDL), that uses unpaired cross-domain adaptation between unrelated CT and MR datasets to hallucinate MR-like images or pseudo MR (pMR) from CT scans. The pMR image is combined with CT image to regularize training of a CT segmentation network. This is accomplished by aligning the features of the CT with the pMR features during training (Fig. 1).

Ours is not a method for data augmentation using cross-domain adaptation [3,4,5]. Our work is also unlike methods that seek to reduce the datashift differences between same imaging modalities [6,7,8]. Instead, our goal in this work is to maximize the segmentation performance in a single less informative imaging modality, namely, CT using learned information modeling the latent tissue relationships with a more informative modality, namely MRI. The key insight here is that the features dismissed as uninterpretable on CT can provide inference information when learning proceeds from a more informative modality such as MRI.

Our approach is most similar in its goal to compute shared representations for improving segmentations as in the work by [9], where several shared representations between CT and MRI were constructed using fully convolutional networks. Our approach, that is based on GANs for cross-modality learning, shares some similarities to [10] that also used a GAN as a backbone framework, and implemented dual networks for performing segmentations on both CT and MRI. However, our approach substantially differs from prior works in its use of the cross-modality tissue relations as priors to improve inference on the less informative source (or CT) domain. Though applied to segmenting lung tumors, this method is generally applicable to other structures and imaging modalities.

Our contributions in this work are as follows: (i) first, we developed a novel approach to generate segmentation on CT by leveraging more informative MRI through cross-modality priors. (ii) second, we implemented this approach on two different segmentation networks to study feasibility of segmenting lung tumors located in the mediastinum, an area where there is diminished contrast between tumor and the surrounding soft-tissue. (iii) third, we evaluated our approach on a large dataset of 637 tumors.

2 Methods

We use a supervised cross-modality and CT segmentation approach with a reasonably large number of expert segmented CT scans (\(X_{CT}, y_{CT}\)) and a few MR scans with expert segmentation (\(\{X_{MR}, y_{MR}\}\), where, \(N_{X_{MR}}\) \(\ll \) \(N_{X_{CT}}\)). The cross-modality educed deep learning (CMEDL) segmentation consists of two sub-networks that are optimized alternatively. The first sub-network (Fig. 1A) generates a pMR image given a CT image. The second sub-network (Fig. 1B), trains its CT segmentation network constrained using features from another network trained using pMRI. The alternative optimization enables the approach to regularize both the segmentation and pMR generation, such that the pMR is specifically tuned to increase segmentation accuracy. In other words, pMR acts as an informative regularizer for CT segmentation, while the gradients of segmentation errors serve to constrain the generated pMR images.

Fig. 1.
figure 1

Overview of the comparison of different combinations. \(x_{c}\) is the CT image; \(x_{m}\) is the MRI image; \(G_{C \rightarrow M}\) and \(G_{M \rightarrow C}\) are the CT and MRI transfer networks; \(x_{m}^{'}\) is the translated MRI image from \(x_{c}\); \(x_{c}^{'}\) is the translated MRI image from \(x_{m}\).

2.1 Cross-domain Adaptation for Hallucinating Pseudo MR Images

A pair of conditional GANs [11] are trained with unpaired CT and T2-weighted (T2w) MR images arising from different sets of patients. The first GAN transforms CT into a pseudo MR (pMR) image (\(G_{C \rightarrow M}\)) while the second, transforms a MR image into its corresponding pseudo CT (pCT) (\(G_{M \rightarrow C}\)) image. The GANs are optimized using the standard adversarial loss (\(L_{adv} = L_{adv}^{CT}+L_{adv}^{MR}\)) and cycle consistency losses (\(L_{cycl} = L_{cycl}^{CT} + L_{cycl}^{MR}\)). In addition, we employed a contextual loss that was introduced for real-world images [12] in order to handle learning from image sets lacking spatial correspondence. The contextual loss facilitates such transformations by treating images as collection of features and computing a global similarity between all pairs of features between the two images (\(\{g_{j \in N}, m_{i \in M}\}\)) used in computing domain adaptation. The contextual similarity is expressed as:

$$\begin{aligned} CX(g,m) = \frac{1}{N}\sum _{j} \underset{i}{m}ax CX(g_{j},m_{i}), \end{aligned}$$
(1)

where, N corresponds to the number of features. The contextual similarity is computed by normalizing the inverse of cosine distances between the features in the two images as described in [12]. The contextual loss is computed as:

$$\begin{aligned} L_{cx} = -log(CX(f(G(X_{CT})), f(X_{MR})). \end{aligned}$$
(2)

The total loss for the cross-modality adaptation is then expressed as the summation of all the aforementioned losses. The pMR generated from this step is passed as an additional input for training the CT segmentation network.

2.2 Segmentation Combining CT with pMR Images

Our approach for combining the CT with pMR images uses the idea of only matching information that is highly predictable from each other. This usually corresponds to the features closest to the output as the two images are supposed to produce identical segmentation. Therefore, the features computed from the last two layers of CT and pMR segmentation networks are matched by minimizing the squared difference or the L2 loss between them. This is expressed as below.

$$\begin{aligned} \begin{aligned} L_{seg}&= \mathbb {E}_{x_{c}\sim X_{CT}}[-log P(S_{MR}(G_{CT\rightarrow MR}(x_{c}))) -log P(S_{CT}(x_{c}))] \\&+ \Vert \phi {_{CT}}(x_{c})-\phi {_{MR}}(G_{CT\rightarrow MR}(x_{c}))||^{2}_{F}, \end{aligned} \end{aligned}$$
(3)

where \(S_{CT}, S_{MR}\) are the segmentation networks trained using the CT and pMR images, \(\phi _{CT}, \phi _{MR}\) are the features computed from these networks, and \(G_{CT \rightarrow MR}\) is the cross-modality network used to compute the pMR image, and F stands for Frobenius norm.

The total loss computed from the cross-modality adaptation and the segmentation networks is expressed as:

$$\begin{aligned} \text {Loss} = L_{adv} + \lambda _{cyc} L_{cyc} + \lambda _{cx} L_{CX} + \lambda _{seg} L_{seg}, \end{aligned}$$
(4)

where \(\lambda _{cyc}\), \(\lambda _{cx}\) and \(\lambda _{seg}\) are the weighting coefficients for each loss. During training, we alternatively update the cross-domain adaptation network and the segmentation network with the following gradients, \(-\varDelta _{\theta _{G}}(L_{adv}+ \lambda _{cyc}{L_{cyc}}+ \lambda _{c}{L_{c}}+ \lambda _{cx}L_{cx})\), \(-\varDelta _{\theta _{D}}(L_{adv})\) and \(-\varDelta _{\theta _{seg}}L_{seg}\). More concretely, the segmentation network is fixed when updating the cross-modality translation and vice versa in each iteration.

2.3 Segmentation Architecture

We implemented the U-net [13] and dense fully convolutional networks (denseFCN) [14] to evaluate the feasibility of combining hallucinated MR for improving CT segmentation accuracy. These networks are briefly described below.

  1. 1.

    U-net was modified using batch normalization after each convolution filter in order to standardize the features computed at the different layers.

  2. 2.

    Fully Convolutional DenseNets (Dense-FCN) that is based on [14], uses dense feature maps computed using a sequence of dense feature blocks and concatenated with feature maps from previous computations through residual connections. Specifically, a dense feature block is produced by iterative summation of previous feature maps within that block. As features computed from all image resolutions starting from the image resolution to the lowest resolution are iteratively concatenated, features at all levels are utilized. This in turn facilitates an implicit dense supervision to stabilize training.

2.4 Implementation and Training

All networks were implemented using the Pytorch [15] library and trained end to end on Tesla V100 with 16 GB memory and a batch size of 2. The ADAM algorithm [16] with an initial learning rate of 1e-4 was used during training. The segmentation networks were trained with a learning rate of 2e-4. We set \(\lambda _{adv}=10\), \(\lambda _{cx}=1\), \(\lambda _{cyc}=1\) and \(\lambda _{seg}=5\). For the contextual loss, we use the convolution filters after the Con7, Conv8 and Conv9 due to memory limitations.

3 Datasets and Evaluation

We used patients obtained from three different cohorts consisting of (a) the Cancer Imaging Archive (TCIA) [17] with non-small cell lung cancers (NSCLC) [18] consisting of 377 patients (training), (b) 81 longitudinal T2-weighted MR scans (scanned on Philips 3T Ingenia) from 21 patients treated with radiation therapy, and (training) (c) 637 contrast-enhanced tumors treated with immunotherapy at our institution for validation (N = 304) and testing (N = 333) such that different sets of patients were used for validation and testing. Early stopping was used during the training to prevent overfitting and the best model selected using validation set was used for testing. Identical CT datasets were used in both CT only and CMEDL approach for equitable comparisons. Expert segmentations were available on all scans.

The segmentation accuracies were evaluated using Dice similarity coefficient (DSC) and Hausdorff distance at \(95^{th}\) percentile (HD95) as recommended in [19]. In addition, we computed the detection rate for the tumors where tumors with at least 50% DSC overlap with expert segmentations were considered as detected.

4 Results

4.1 Tumor Detection Rate

Our method achieved the most accurate detection using both U-net and DenseFCN methods for validation and test sets. In comparison the CT-only method resulted in much lower detection rates for both networks (Table 1).

Table 1. Detection and segmentation accuracy using the two networks.
Fig. 2.
figure 2

Box plots comparing CT-only and CMEDL-based networks.

4.2 Segmentation Accuracies

The CMEDL approach resulted in more accurate segmentations than CT-only segmentations (see Table 1). In addition, both of the U-net and denseFCN networks trained using CMEDL approach were significantly more accurate than CT only segmentations when evaluated with both DSC (\(P < 0.001\)) and HD95 (\(P < 0.001\)) metrics. Figure 2 shows the box plots for the validation and test sets using the two metrics and the two networks. P-values computed using paired Wilcoxon two-sided tests are also shown.

4.3 Visual Comparisons

Figure 3 shows visual segmentation results produced by the different networks for representative cases when trained using CT-only and with the CMEDL approach. As seen, in both networks, the CMEDL method closely follows the expert-segmentation that is missed using CT-only networks. Figure 4 shows the feature map activations produced using U-net CT only and with Unet CMEDL. As seen, the feature activations are minimal when using CT-only but shows a clear preferential boundary activation when incorporating the MR information. Figure 4(b) also shows a pseudo MR produced from a CT (Fig. 4(a)).

5 Discussion

We developed a novel approach for segmenting lung tumors located in areas with low soft-tissue contrast by leveraging learned prior information from more informative MR modality. These cross-modality priors are learned from unrelated patients and are used to hallucinate MRI to inform CT segmentation. Through extensive experiments on two different network architectures, we showed that leveraging a more informative modality (MRI) to inform inference in a less informative modality (CT), improves segmentation. Our work is limited by lack of sufficiently large MR datasets to potentially improve the accuracy of cross-domain adaptation models. Nevertheless, this is the first approach to our knowledge that used the cross-modality information in a novel way to generate CT segmentation.

Fig. 3.
figure 3

Representative segmentations produced using CT-only and CMEDL-based segmentations for U-net and DenseFCN networks. The Dice similarity coefficient (DSC) is also shown for each method. Red corresponds to algorithm, green to expert and yellow to overlap between algorithm and expert. (Color figure online)

Fig. 4.
figure 4

Feature map activations from the 21 channel of last layer of Unet. (a) the original CT (b) the translated pMRI (c) activation from CT only (d) activation from pMRI (e) activation from CMEDL

6 Conclusions

We introduced a novel approach for segmenting on CT datasets that can leverage more informative MR modality through cross-modality learning. Our approach implemented on two different segmentation architectures shows improved performance over CT-only methods.