Keywords

1 Introduction

Driven by deep learning, artificial intelligence (AI) has attracted a widespread interest towards solving many challenging clinical problems. In medical AI applications, image segmentation is one of the mostly affected field from deep learning as it is often the first step for many image analysis tasks (shape analysis, volume measurements, and computer aided diagnosis). Since manual measurements are very expensive, time consuming, and prone to inter- and intra-observer variations, having an automated, accurate, and efficient segmentation tool is the ultimate goal in many medical systems. In the deep learning era, numerous works have been published, showing feasibility of deep learning in segmentation of radiology images. However, most of these works focus on new network architectures adopted to the medical problem, and they rarely consider the fundamental challenge of the medical AI: availability of precisely annotated data.

In this work, we propose a weakly-supervised segmentation method coupled with a deep geodesic prior to solve 3D medical image segmentation problem in a robust manner. This prior is mainly introduced to improve the performance of segmentation networks, more specifically when the annotations are noisy (i.e., not excellent). We argue that our proposed method is a significant step toward using inexpert and noisy annotations to train deep models for image segmentation without sacrificing the accuracy. The deep geodesic prior is specifically designed to put more attention in constructing accurate edges from weak labels. Although the proposed strategy is generic and can theoretically be applied to any medical image segmentation problem, herein we focus on cardiac MRI analysis due to its clinical significance and challenging nature of the MR imaging [1].

Related Works: One of the most successful deep-learning based segmentation methods is based on an encoder-decoder style architecture, called SegNet, proposed by [2]. Among many other architectures, U-Net [10] has became the most popular due to its properties such as efficient flow of low-level features through skip connections from decoder to encoder. To decrease the highly occupied parameter space of U-Net architecture(s), new network architectures were also presented: for instance, segmentation capsules (SegCaps) [7], densely-connected network [4] and Tiramisu [5] have shown drastic decrease in parameter space, while maintaining relatively good accuracy compared to baseline U-Nets.

The literature for integrating shape priors into image segmentation is vast, mostly from pre-deep learning era. A mainstream approach is to construct a shape prior from a set of training samples represented implicitly by signed distance functions [6, 8]. In the deep learning era, Zotti et al. [12] used image registration to align shape priors and created atlas(es) to guide segmentation. Simply, authors have used this atlas for adding an extra loss term to the segmentation network. Modeling a prior (in shape or appearance) from medical images is still a challenging task due to highly diverse appearance, shape, and size of the anatomical objects. The first attempt to model shape prior with deep features was done by training an auto-encoder (AE) for creating features from the binary labels [9]. The AE was trained to reconstruct the binary input images in its output with a fully-connected layer as a bottleneck to capture the shape features. Then, these shape features were integrated into the segmentation network through an appropriate loss term. While the work in [9] is promising, it is not entirely clear whether the local anatomical variations are captured in detail.

We hypothesize that, if modeled correctly, prior information can lead to a more robust segmentation even when the labels are noisy (i.e., labels annotated by non-experts). To test this hypothesis, we propose a novel method for learning the prior from the geodesic maps of multiple objects. Then, an AE-like network is used to generate the original binary images from their corresponding geodesic maps. Finally, the features from the trained AE are used as a prior to be integrated into the segmentor for better guidance and performance improvement.

2 Method

Our framework consists of two main components: (1) the segmentation network (or segmentor in short), and (2) the geodesic prior learning network. While our segmentation network assigns a class label to each pixel of the input image, our second network (AE) learns a prior from geodesic maps, generated for each object of interest (for multiple objects). We anticipate (and show later in the results section) that incorporating a well-designed prior into the segmentation network improves its performance, especially in the presence of inaccurate labels.

The overview of our approach is illustrated in Fig. 1. The segmentation network (\(Net_{seg}\)) (Fig. 1(a)) is an encoder-decoder architecture with 3D kernel convolutions. We utilize skip connections in the form of dense connection throughout this network [4]. For the prior learning network (AE), noisy annotations (or binary ground truths) are used to generate geodesic maps (Fig. 1(c)). Then, the geodesic auto-encoder (GAE) is designed to generate binary ground truths from geodesic maps. Once trained, GAE can be used to calculate two sets of bottleneck features: features resulted from feeding the geodesic map to GAE, and features resulted from feeding the corresponding \(Net_{seg}\)’s output probability map to GAE. Finally, the distance between these two feature vectors are used to form an extra term in the loss function of the \(Net_{seg}\) (Fig. 1).

Fig. 1.
figure 1

The proposed framework has two main components: (1) segmentation network: assigns a class label to each pixel, and, (2) AE which learns a prior from geodesic maps. Green arrows show the flow of training of the segmentor and red arrows show the flow of training of GAE. Note that in test phase only segmentor is used. Also, the in our method the noisy annotations are used for whole training process. (Color figure online)

We define the segmentation network as a function, mapping a gray-scale 3D input image \(I_i(I_i \in R^3)\) into a probability map \(P_i\), (\(i\in \{1, 2, \dots , N\}\)), N being number of 3D images:

$$\begin{aligned} P_{i} = Net_{seg}(I_i, \theta _{seg}), \end{aligned}$$
(1)

where \(\theta _{seg}\) are the parameters of \(Net_{Seg}\), trained to minimize \(\mathcal {L}_{tot}\) defined in Eq. 3. A geodesic map, on the other hand, is generated from ground truth binary images \(B_i\) as \(G_i = F_{geo}(B_i)\) and then a GAE network (\(Net_{gae}\)) is trained to generate binary image from this corresponding geodesic map (explained in Sect. 2.2). The GAE consists of an encoder, a fully connected (FC) layer, and a decoder. The encoder and FC layers are the feature extraction (\(Enc_{gae}\)) parts, mapping the input geodesic map to the feature vector \(Feat_{gae}=Enc_{gae}(G_i)\) of length \(L_{feat}\). The decoder (\(Dec_{gae}\)) reconstructs the corresponding binary image(s) from the \(Feat_{gae}\). Hence, the geodesic network can be formulated as:

$$\begin{aligned} \hat{B}_i = Net_{gae}(G_i, \theta _{gae}) = Dec_{gae}(Enc_{gae}(G_i, \theta _{enc}), \theta _{dec}), \end{aligned}$$
(2)

where \(\theta _{gae}\) are the parameters of \(Net_{gae}\), trained to minimize the binary map reconstruction loss \(\mathcal {L}_{recon}\) in \(Net_{gae}\). \(\mathcal {L}_{recon}\) is a cross-entropy loss between ground-truth and \(Net_{gae}\)’s output and \(\theta _{gae}=\theta _{enc}\,\cup \,\theta _{dec}\). Since \(Net_{gae}\) is designed to learn the relation between geodesic maps and their corresponding binary maps, \(Feat_{gae}\) contains high-level features inferred from shapes and texture of the objects of interests. This encoded knowledge can be used as an extra term of supervision for better training of \(Net_{seg}\). For each training sample i, we calculate a loss function \(\mathcal {L}_{gae}(Enc_{gae}(P_i), Enc_{gae}(G_i))\) to be back-propagated into the segmentation network along with loss of the segmentor itself. The total loss function of the segmentator is then represented as:

$$\begin{aligned} \mathcal {L}_{tot}=\sum _i \mathcal {L}_{seg}(P_i, B_i)+\mathcal {L}_{gae}(Enc_{gae}(P_i), Enc_{gae}(G_i)). \end{aligned}$$
(3)

We first train \(\theta _{gae}\) with \(\mathcal {L}_{recon}\). Once trained, \(\theta _{seg}\) is updated with the loss function \(\mathcal {L}_{tot}\), while \(\theta _{gae}\) are fixed.

Fig. 2.
figure 2

Dense Block (DB) content. C: concatenation operation, BN: Batch normalization, LReLu: Leaky ReLu activation, and 3D conv: 3D convolution with a \(3 \times 3 \times 3\) filter.

2.1 Network Architecture for Segmentation

We extend fully convolutional dense nets, called Tiramisu [5], from 2D to fully 3D. The details of the adapted network are shown in Fig. 1(a). The encoder and decoder include four dense blocks, each. Within dense blocks, there are four 3D convolution layers followed by a Batch Normalization (BN) layer with Leaky ReLu nonlinear activation. The size of the convolution kernels are set to \(3 \times 3 \times 3\) in all convolution layers (Fig. 2). The number of filters in the first convolution layer and the growth rate are set to 16 for an optimal performance after extensive explorations. The number of output filters in a dense block is \(N_f(\mathbf {X}_{ {l}})=N_f(\displaystyle \underset{{ {l}'=0}}{\overset{{ {l}'= {l}-1}}{\Vert }} N_f(\mathbf {X}_{ {l}'}))\), where \( {l}=\{1,2,\dots , {L}\}\) and \(\Vert \) is the concatenation.

The encoder includes four 3D max pooling operations as transition layers after each dense block. The pooling operation is set to downsample the input size by 2 in x-y plane. Downsampling is not applied to z direction due to its low resolution (but could be applied for other settings). Similarly, the decoder includes four up-samplers (using bilinear interpolation) as transition layers. Each up-sampler is designed to double the size of its input. Finally, the last layer contains a convolution layer following by a softmax function to introduce a notion of probability map in the output. We use Adam optimizer to minimize the \(\mathcal {L}_{tot}\) in \(Net_{seg}\). \(\mathcal {L}_{seg}\) and \(\mathcal {L}_{gae}\) are designed with Cross Entropy and Mean Square Error functions, respectively.

2.2 Learning a Geodesic Prior

Most existing literature related to prior incorporation into segmentation utilize accurate binary labels for extracting shape information [9, 12]. Unlike the mainstream studies, we propose to use geodesic maps to increase robustness of the priors when dealing with noisy labels, which has never been done before. This approach can particularly be beneficial when the object has complex boundary information to be delineated. In this study, geodesic maps are generated from labels, regardless of being noisy or clean. We expect the proposed geodesic map to capture more information then conventional shape priors.

For each object in our images, we compute an independent geodesic distance map from its binary map \(B_i\) by using the Fast Marching (FM) approach [11]. FM is a numerical method to solve boundary value problems of the Eikonal as:

$$\begin{aligned} {\left\{ \begin{array}{ll} F(x) |\varDelta T(x)| = 1 ,&{} \forall x \notin S,\\ F(x)=0, &{} \forall x \in S, \end{array}\right. } \end{aligned}$$
(4)

describing the evolution of a contour as a function of time, T(x), with the speed of F(x) in the normal direction at a point x on the propagating surface starting from the zero-level S. With a specified speed, F(x), the time when the contour crosses point x can be computed by solving Eq. 4. In this setting, the special case of \(F(x)=1\) gives the signed distance of every point x from S. In our case, since we have multiple objects (i.e., 3 objects: LV, RV, Myocardium), we defined S as the center of mass of the all closed objects. In our experimental setup, we have also an object with non-Jordan surface (i.e., Myocardium, having donuts shape). In order to include such objects in the geodesic computation, we simply define the the skeleton of the shape and the distances of all the points within each object are computed from their zero line contours S. These maps (obtained from each object) are combined in n-channels (i.e., 3 in our experiments: LV, RV, Mayo) and fed into the auto-encoder as described below.

The proposed AE architecture (\(Net_{gae}\)), for learning prior information, is illustrated in Fig. 1(b). This architecture is very similar to the segmentor with a FC layer in the middle (instead of a convolution layer) to generate deep geodesic features. Also, in order to increase the robustness of these features, there is no skip connections from encoder to decoder. Both encoder and decoders include four DBs and the filters size in each convolution layer was set to \(3 \times 3 \times 3\). The growth rate for the encoder part is set to 16 (empirically) as in the segmentor and Adam optimizer is used to minimize \(\mathcal {L}_{recons}\) (Cross Entropy loss).

3 Experiments and Results

To show the robustness of our algorithm and its performance on noisy labels, we ran all the experiments on both expert as well as two levels of noise in the labels (L1 and L2). We reported Dice Index (DI) and Hausdorff distance (HD) in Table 1. Also, in cases where we were dealing with noisy labels, there was an upper bound for the performance of the networks. This upper bound was due to lack of information in the presence of noise. To have a sense of this upper bound, for the sake of a more extensive and fair comparison, DI and HD of generated noisy labels with respect to clean ground truths are reported in this table (Upper boundary columns). Higher DI and lower HD indicate a superior segmentation performance. While training of the networks were done using weak/noisy labels, validation was performed on expert/accurate labels.

Data set: We used the cine MR cardiac data set from Automated Cardiac Diagnosis Challenge (ACDC) MICCAI challenge 2017 [3]. The images in this data set were obtained from two MRI scanners of different magnetic strengths (1.5 T and 3.0 T). Cine MR images were acquired in breath-hold with a retrospective or prospective gating and with a SSFP sequence which LV was covered by series of short axis slices with a thickness of 5 mm. The spatial resolution of the images goes from 1.37 to 1.68 mm\(^2\)/pixel and 28 to 40 images cover completely or partially the cardiac cycle. Out of 150 cine MR images, we used 100 for training and validation (including expert annotation for LV, Myo, and RV at ES and ED), and the remaining 50 for testing (with online evaluation). Subject categories (5) are the following: 30 normal subjects, 30 with myo infarction, 30 with dilated cardiomyopathy, 30 with hypertrophic cardiomyopathy, and 30 with abnormal right ventricle which make the segmentation task challenging over different cases.

Fig. 3.
figure 3

Generating noisy labels. Binary images went through a two-step process of adding pepper noise and filling inside object which makes the boundaries inaccurate.

Generating Noisy Labels: The current annotations at ES and ED in ACDC dataset are considered as expert annotations (as clearly defined by the challenge organizers). Usually, the inexpert annotations include some under-segmentation and/or over-segmentation. This is due to lack of naive annotator’s knowledge in finding the edges. Thus, in order to mimic such inexpert annotations (weak labels), we manipulated the ground truths as follows: first we obtained the outer shell of each object by applying erosion to the binary image and then calculate the difference between the original binary image and eroded one. Then, salt-pepper noises were added to each object’s binary shell randomly and filling was applied to the shell. This process effected only edges of the objects without changing the background, resulting in a shape with distorted boundaries. Finally, eroded binary edges was added to the distorted shell. A sample of weak labels vs. expert labels is shown in Fig. 3. Participating radiologists confirmed the weak labels through visual evaluations.

Baseline Models for Comparisons: We have conducted several baseline architectures in order to show the strength of our proposed method. First, the segmentor was trained only with \(\mathcal {L}_{seg}\) and without using prior information with both of the inexpert and expert annotations. Also, in order to illustrate the advantage of using the geodesic maps instead of binary maps in modeling shape information, the segmentation network was trained with the prior information obtained from binary AE (segmentation + binary labels shape). In this baseline, the AE was trained to reconstruct binary map in its output from its corresponding binary input map (instead of geodesic in our method). The results of this baseline (Binary Prior) and our proposed method (Geodesic Prior) for inexpert and expert annotations are reported in Table 1.

Table 1. DI and HD are reported for both expert and inexpert labels.

Implementation Details: As a pre-processing step, we applied the anisotropic filtering to reduce noise from MRI, histogram matching to standardize MRI intensities, and all images were resized to \(200\times 200\times 10\) by using B-spline interpolation. First the GAE was trained (early-stopping) and then the deep geodesic features (\(Feat_{gae}\)) were extracted from training data. Then, during training of \(Net_{seg}\) the output probability maps of the \(Net_{seg}\) were passed though \(Enc_{gae}\) and then the loss (\(\mathcal {L}_{gae}\)) between two feature vector was calculated and back-propagated though the \(Net_{seg}\). Finally, Conditional Random Field is used for post-processing. We used 80 MR images for training, the 20 images were used as validation, and the 50 images were used for test (with online evaluation). Also, we have used NVIDIA Quadro P6000 GPU for training the networks.

4 Conclusion

In this study, we propose a novel framework incorporating a deep geodesic prior information into the segmentation framework. Our AE network is capable of learning high-level features from generated geodesic maps for multiple objects. We show that our proposed approach outperforms the state-of-the-art methods both on clean and noisy labels with several key advantages. First, incorporating prior information improves segmentation results even with imperfect ground truths. Second, more specifically, shape prior is shown to be useful both in Euclidean and Geodesic distance based evaluations, and geodesic priors are shown to be more accurate than the former. Furthermore, the proposed network is capable of performing fully 3D image segmentation unlike most 2D methods in the literature, and can handle multiple objects too. One may argue to obtain inexpert annotations directly from inexpert annotators for more realistic evaluations, but the regulations for medical imaging data sharing in general (and ACDC challenge in particular) didn’t allow to get such annotations. Our future work will comprehensively test our hypothesis for different components of our present study such as different clinical imaging problem, large number of inexpert annotators, and inspecting the results based on object type and size.