Abstract
Convolutional neural networks have advanced the state of the art in medical image segmentation. However, there are two challenges in 3D deep learning segmentation networks. First, the segmentation masks from deep learning networks lack shape constraints, often resulting in the need for post-processing. Second, the training and deployment of 3D networks require substantial memory resources. The memory requirement becomes an issue especially when the target organs cover a large footprint. Commonly down-sampling and up-sampling operations are needed before and after the network. To address the post-processing requirement, we present a new loss function that incorporates the level set based smoothing loss together with multi Dice loss to avoid an additional post processing step. The formulation is general and can accommodate other deformable shape models. Further, we propose a way to integrate the down- and up-sampling in the network such that the input of the deep learning network can work directly on the original image without a significant increase in the memory usage. The 3D segmentation network with the proposed loss and sampling approach shows promising results on a dataset of 48 chest CT angiography images with 16 target anatomies. We obtained average Dice of 79.5% in 4 fold cross validation.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Advances in deep learning segmentation methods have enabled faster 2D and 3D segmentation [1,2,3]. In these networks, compared to traditional methods, high-level deeply learned features from a receptive field are used. Compared to stacked 2D slice segmentation, 3D segmentation has a better chance of producing consistent and continuous object shapes. However, learning a 3D volume neural network segmentation faces two challenges. First, each voxel is classified using content from a receptive field with certain size but the overall shape of the object is not taken into account. Therefore, a post processing step to further refine the segmentation is usually needed. To address this problem, Kamnitasas et al. [3] used fully connected conditional random fields (CRF) to refine brain lesion segmentation in a post processing step. Lu et al. [4] used graph cut in the post processing. Level set is also often used as a post processing step to refine the segmentation output from deep learning networks [2, 5, 6]. In the deployment stage, the deep learning step takes milliseconds while the post processing step usually takes longer. Thus integrating the post processing step in the learning of the deep learning weights can further speed up and simplify the segmentation process in the deployment stage. Tang et al. [7] proposed a deep level set method for liver CT and left ventricle MRI segmentation. They use level set to refine an initial segmentation from a network trained with limited data, and then backpropagate the loss between the refined segmentation and the deep learning output. However, their method does not have an explicit mathematical formulation of the integration. Hu et al. [8] proposed a framework to let the network learn a levelset function for objects instead of a probability map. However, their levelset function is obtained by directly substracting 0.5 from the probability map, which is different from the commonly used signed distance transform method.
Second, 3D volume segmentation requires significant memory becuase of the huge number of weights learned. Constrained by the memory limit, usually small volumes, either from downsampling of the original image or smaller cropped regions are fed into a deep learning network. Milletari et al. [1] and Çiçek, et al. [9] downsampled the original image before feeding into the network and upsampled it back. However, this downsampling method results in lost information. Besides downsampling, Gibson et al. [2] also used batch-wise spatial dropout and Monte Carlo inference to reduce memory costs without affecting performance. Memory usage can also be reduced if fewer kernels in each layer or fewer layers are used in the network. However, reducing the number of kernels will reduce the number of learned latent features and increase the risk of getting a biased network. Reducing the number of layers will shorten the network depth and thus result in a smaller receptive field and lose part of the neighborhood information.
In this paper, our contribution focuses on addressing these two challenges. For the first challenge, we propose a novel way to integrate a level set energy function into Dice based loss. We show that the new proposed loss can drive the learning of the network weights such that the segmentation output of the network has the smooth property defined by a level set energy function. This smoothing energy is propagated back into the network to train a set of weights that can output a smoother segmentation. For the second challenge, our strategy of processing large volumes is integrating downsampling and upsampling into the network to process a larger volume. We evaluate the proposed method in 48 chest CTA datasets where 16 anatomies are manually segmented. We demonstrate the efficiency of integrating post processing into deep learning network while reducing the processing time for a volume to millisecond.
2 Method
We first propose the framework of integrating the surface smoothing into deep learning training, followed by a modified segmentation network that handles large volumes by adding very few parameters to the network.
2.1 Integrating Level Set Energy into Network Loss Function
The softmax output of a segmentation network is bounded between 0 and 1. As such, the \(n_{th}\) output can be treated as a Heaviside function \(H^n(\mathbf {x})\) of a latent surface S and its corresponding level set embedding function \(\phi ^n(\mathbf {x})\) can be obtained using signed distance transform. From a given \(\phi ^n(\mathbf {x})\), the corresponding Heaviside function is approximated as \(H^n(\mathbf {x})=\frac{1}{2}(1+\frac{2}{\pi }arctan(\frac{\phi ^n(\mathbf {x})}{\epsilon }))\).
In level set representation, smoothing a surface is equal to evolving its corresponding embedding function. Thus the level set loss used for smoothing a surface is defined as:
where \(\varOmega \) is the volume inside the surface S. \(\mathbf x \) is the voxel index. \(\delta ^n(\mathbf {x})\) is the gradient of \(H^n(\mathbf {x})\) with regard to \(\mathbf x \), and is equal to \(\frac{\epsilon }{\pi (\epsilon ^2+\phi ^n(\mathbf {x})^2)}\).
Different types of loss, such as cross entropy loss, Dice based loss for binary segmentation [1], or probabilistic Dice scores [2] were proposed to train a segmentation network. We took multi Dice, which is the sum of Dice for different organs as an example in this paper to integrate with level set based surface energy.
Using \(H(\mathbf {x})\) to denote the group of \(H^n(\mathbf {x})\) for all anatomies, the overall loss to minimize can be written as:
where \(E_1\) is the multi Dice based loss and \(E_2\) is the level set based loss. The level set based loss is defined to be the overall area of the segmentation surface for the \(n_{th}\) anatomy. \(g^n(\mathbf {x})\) is the ground truth binary mask of the \(n_{th}\) anatomy, \(w_n\) is the weight used for different anatomies. N is the number of anatomies.
For back propagation, we need to compute the gradient of the loss with respect to the network prediction \(H^n(\mathbf {x})\):
in which the first part can be calculated as:
where i and j are voxel indices. The second term can be calculated as:
where \(\frac{\partial \phi ^n(\mathbf {x})}{\partial H^n(\mathbf {x})}\) is difficult to be solved analytically, so we approximate it as \( \frac{\triangle \phi ^n(\mathbf {x})}{ H^n(\phi ^n(\mathbf {x})+\triangle \phi ^n(\mathbf {x}))-H^n(\phi ^n(\mathbf {x}))}\). The gradient of \(E_2(H(\mathbf {x}))\) with respect to \(\phi ^n(\mathbf {x})\) is given as [10]:
\(div(\frac{\triangledown \phi ^n(\mathbf {x})}{|\triangledown \phi ^n(\mathbf {x})|})\) is the mean curvature of a surface. Equation 6 evolves \(\phi ^n(\mathbf {x})\) by the surface curvature in the direction of the surface norm, which will result in a smoother surface. The sign of the curvature determines whether a point on the surface should move inward or forward in the direction of surface normal.
2.2 Segmentation Network Architecture
Inspired by image guided filtering [11], which uses an additional image to guide the filtering of a target image, we propose the architecture in Fig. 1 to use the raw image to guide the upsampling of the low resolution segmentation maps.
The raw image is first downsampled with ONE kernal downsampling convolution (this can be replaced by average pooling), and the downsampled image is fed to VNet [1]. We replace the last softmax layer in standard VNet with PRelu layer. The raw image is then upsampled by a deconvolution layer with the number of channels preserved, which is equivalent to the number of anatomies + background. The deconvolution layer can be replaced by a bilinear resampling layer and a convolution layer. The deconvolution output is then concatenated with the raw image in the channel dimension, followed by two convolution and activation layers. The downsampling and upsampling added only 31698 weights when the number of anatomies n equals 16 (3 \(\times \) 3 \(\times \) 3 for the downsampling convolution layer, (n+1) \(\times \) 3 \(\times \) 3 \(\times \) 3 \(\times \) 2 \(\times \) (n+1) for the deconvolution layer and the followed convolution layer, as well as (n+2) \(\times \) 3 \(\times \) 3 \(\times \) 3 \(\times \) (n+1) + (n+1) \(\times \) 3 \(\times \) 3 \(\times \) 3 \(\times \) (n+1) for the last two convolution layers), this number can be reduced to 513 for a binary segmentation. Thus, most of the computation stays inside the VNet architecture whose input size is half the original input size in each dimension. This allows the processing of a large image without adding much to memory cost.
2.3 Implementation Details
This method is implemented in Caffe and runs on one TITAN X GPU with 12 GB of memory. The proposed architecture is first trained using multi Dice loss for 300 epochs until it converges. And then we continue training using the proposed loss which integrates the level set smoothing energy for 15 epochs. Since we have anatomies with naturally different surface curvature, we also set different smoothing weights for different anatomies. For the vertebrae, the myocardium and the left ventricle, we set the weights to be 1e–5, while for others we use weight of 1e–4.
3 Experiments and Results
3.1 Data
We collect 48 cardiac CTA images annotated for 16 anatomical structures by one annotator. The 16 anatomies are: sternum, ascending aorta, descending aorta, aortic arch, aortic root, left pulmonary artery, right pulmonary artery, trunk pulmonary artery, vertebrae, left atrium, right atrium, left ventricle, right ventricle, left ventricular myocardium, superior vena cava, and inferior vena cava similar to [12]. The cardiac CT studies used here were acquired by a Siemens CT scanner. All images have voxel size of 1.5 mm in all directions.
3.2 Results
For the first stage of training which does not have the level set integrated loss, we obtain average Dice of 79.3% for 4-fold cross validation. After continued training with level set based smoothing energy, we obtain Dice of 79.5%. We must specify that the sole expert that provided the ground truth segmentation was not asked to target smoothness, thus adding the smoothing term does not necessarily improve Dice coefficient. But we can visually observe a smoother segmentation. Figure 2 shows two superior vena cava segmentation outputs generated from two trained models with and without level set smoothing energy. We see that some false positives due to image noise and lack of shape information are removed because of their high curvature property.
As a qualitative way of understating the effects of the new loss function on smoothing the structures, consider the case of spine as illustrated in Fig. 3. The progressively smooth volume after epochs 8 and 15 are visible. To better visualize the smoothing effect, we use a large weight (1e–4) in this example only for demonstration. When applying this method to other applications, the number of training epochs and the weight \(w_n\) should be tuned as hyper parameters.
3.3 Performance Comparison
We compared our results with the state of the art multi atlas based segmentation method followed by corrective learning as post processing [12]. Figure 4 shows the bar plot comparing Dice per anatomy for five different methods: the multi atlas based method, the standard VNet which takes resampled volumes, the standard VNet followed by a level set smoothing step as post processing, the proposed modified VNet architecture trained by multi Dice loss, and the proposed modified VNet architecture trained by our proposed loss. For the standard VNet, due to the memory limit, the input was downsample with voxel size of 2 mm \(\times \) 2 mm \(\times \) 3.5 mm and volume size of 128 \(\times \) 192 \(\times \) 64.
From Fig. 4, we can see that the deep learning method is comparable to the state of the art multi atlas based segmentation. The proposed modified VNet architecture trained with the proposed loss performs the best among the deep learning methods. For small anatomies such as aortic root, left pulmonary artery and superior vena cava, we get a larger boost in the performance than large anatomies. We show the summary of the comparisons in Table 1. An example of the segmented volume compared to the ground truth is shown in Fig. 5.
4 Conclusion
In this paper, we propose a new loss function to integrate the level set smoothing energy into multi Dice loss to eliminate an additional post processing step. We also propose a new strategy for designing segmentation architectures that can process large volumes by adding very few parameters. This method produces accurate and fast anatomic segmentation in CTA images. The new network boosts the dice from 0.766 to 0.795. The proposed framework for integrating level set with network training is general and can be extended to other types of level set energy functions.
References
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Gibson, E., et al.: Automatic multi-organ segmentation on abdominal CT with dense v-networks. IEEE Trans. Med. Imaging 37, 1822–1834 (2018)
Kamnitsas, K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017)
Lu, F., Wu, F., Hu, P., Peng, Z., Kong, D.: Automatic 3D liver location and segmentation via convolutional neural network and graph cut. Int. J. Comput. Assist. Radiol. Surg. 12(2), 171–182 (2017)
Hu, P., Wu, F., Peng, J., Bao, Y., Chen, F., Kong, D.: Automatic abdominal multi-organ segmentation using deep convolutional neural network and time-implicit level sets. Int. J. Comput. Assist. Radiol. Surg. 12(3), 399–411 (2017)
Veni, G., Moradi, M., Bulu, H., Narayan, G., Syeda-Mahmood, T.: Echocardiography segmentation based on a shape-guided deformable model driven by a fully convolutional network prior. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 898–902, April 2018
Tang, M., Valipour, S., Zhang, Z., Cobzas, D., Jagersand, M.: A deep level set method for image segmentation. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS -2017. LNCS, vol. 10553, pp. 126–134. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67558-9_15
Hu, P., Shuai, B., Liu, J., Wang, G.: Deep level sets for salient object detection. In: IEEE CVPR (2017)
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Osher, S., Sethian, J.A.: Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. J. Comput. Phy. 79(1), 12–49 (1988)
He, K., Sun, J., Tang, X.: Guided image filtering. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 1–14. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_1
Wang, H., Prasanna, P., Syeda-Mahmood, T.: Fast anatomy segmentation by combining low resolution multi-atlas label fusion with high resolution corrective learning: an experimental study. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 223–226. IEEE (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Tang, H., Moradi, M., Wong, K.C.L., Wang, H., Harouni, A.E., Syeda-Mahmood, T. (2018). Integrating Deformable Modeling with 3D Deep Neural Network Segmentation. In: Stoyanov, D., et al. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA ML-CDS 2018 2018. Lecture Notes in Computer Science(), vol 11045. Springer, Cham. https://doi.org/10.1007/978-3-030-00889-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-00889-5_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00888-8
Online ISBN: 978-3-030-00889-5
eBook Packages: Computer ScienceComputer Science (R0)