Keywords

1 Introduction

Segmentation of brain magnetic resonance (MR) image is an important task for early diagnosis of brain diseases and for identifying the cause of diseases. Accordingly, many classical machine learning algorithms have been proposed [5], but the variability between subjects and the presence of abnormalities make these methods unreliable in large scale studies.

Recently, deep learning approaches significantly outperform the previous learning methods [1], and thus are used for the segmentation of overall brain structures, tumors, or other abnormalities. Shreyas et al. and Cui et al. extract 2D axial view slices [4, 12] and then apply modified versions of U-net [11] or FCN [10] to segment brain tumors. Chen et al. [2] propose a 3D convolutional neural network (CNN) to perform volumetric brain segmentation. To address the memory limitation, they sample 3D patches that include relatively large area allowed by memory, and then input them to the CNN consisting of multiple convolution and pooling layers to take advantage of a large receptive field on their filters. On the other hand, Dolz et al. [6] extract smaller 3D patches to reduce memory requirements, maintain the volume resolution by avoiding pooling layers and to substantially increase the number of training samples. To use contextual information in both small and large scales, Kamnitsas et al. [8] sample two volume patches of the same region at two different resolutions and then train a multi scale CNN model.

In this paper, we propose a patch wise 3D convolutional neural network to perform segmentation of brain tissues. The network consist of encoding layers and decoding layers similar to conventional U-net [11], but use relatively small 3D volumes as inputs to learn various local patterns as in Dolz et al.’s method [6]. Unlike the U-net, we add transition layers between the encoding layers and the decoding layers to emphasize the impact of feature maps in the decoding layers. We also include batch normalization on every convolution layer to prevent over-fitting and make a well generalized model. Finally, we train our network with a normalized categorical cross entropy to balance the loss from small abnormalities such as white matter hyperintensity and that of large brain structures like white matter and gray matter.

The proposed model using the transition layer and the batch normalization can distinguish multiple brain tissues by effectively generalizing various local patterns. Moreover, the normalized loss function helps to accurately segment small abnormalities which are opt to be misclassified (Fig. 1).

Fig. 1.
figure 1

MRBrainS18 challenge dataset and our result. Multi-modal MR images are shown on the top row (FLAIR image, T1 weighted image and Inverse Recovery T1 weighted image). The ground truth (left) and the prediction (right) generated by our method are shown on the bottom row.

2 Dataset

The data used in MRBrainS18 challenge was acquired from 30 subjects on a 3T scanner at the UMC Utrecht. For each subject, fully annotated multi-sequences (T1-weighted, T1-weighted inversion recovery and T2-FLAIR) were provided. The 30 subjects included patients with diabetes, dementia and Alzheimers, and matched controls (with increased cardiovascular risk) with varying degrees of atrophy and white matter lesions (age > 50). For training, 7 data samples were provided with a file containing the manual reference standard with following 11 labels: (0) Background, (1) Cortical gray matter, (2) Basal ganglia, (3) White matter, (4) White matter lesions, (5) Cerebrospinal fluid in the extracerebral space, (6) Ventricles, (7) Cerebellum, (8) Brain stem, (9) Infarction and (10) Other. The objective was to segment the 8 labels excluding the background, infarctions and other lesions on the remaining 23 data samples.

3 Method

We propose a patch wise segmentation method using 3D fully convolutional neural networks. The network takes as input a 3D volume and produces a 3D output prediction map of the same size as the input. The volume could be the entire MRI scan, but the size is too large to fit it in a modern GPU and the model is not generalized efficiently. Therefore, we use the CNN to make predictions on small 3D sections of the scan and finally use them to reconstruct a full size prediction map.

We use an encoder-decoder CNN structure since the pooling layers increase the receptive field of the filters and they allow the model to learn higher level features and detect larger volumes more accurately. We put skip connections from encoder to decoder to allow the model to use geometrical information from lower level features, but unlike U-Net [11], we use a convolutional transition layer from encoder to decoder to reduce the size of the low level feature maps in order to give more weight to the higher level features learned through deeper layers in the network. We also normalized our loss function to ensure every class, regardless of its size, has a similar impact during the training process.

3.1 Proposed Network

The proposed network consists of 17 convolutional layers divided into 3 stages; encoder, transition, and decoder (see Fig. 2). All convolutional layers have \(3\times 3\times 3\) kernels, ReLU activation and a batch normalization [7] layer after each convolutional layer. The CNN works with 3D patches heuristically defined as \(8\times 24\times 24\) voxels on the z, y and x coordinates by considering the spacing information.

Fig. 2.
figure 2

Description of the proposed CNN model. The blue blocks represent feature maps at the encoding stage. They downsize through the pooling layers, but also connect with the decoding stage through the transition layers. Green feature maps are the results of up-convolution operation. They are concatenated with reduced channel maps from the transition layers. Green and yellow feature maps go through convolution operations twice before the next up-convolution operation. (Color figure online)

The encoder takes 3D input patches from the set of input images, one for each MRI modality. The patches from the input images are concatenated (i.e., the block size is \(8\times 24\times 24\times 3\), where 3 is the number of input modalities) and then pass through a series of three convolutional layers with 64 output filters in cascade (Level 1). The output feature maps are reduced to the half of original size (i.e., \(4\times 12\times 12\)) by an average pooling and then pass through four convolutional layers with 128 filters in cascade (Level 2). In the same manner, the feature maps are reduced once more to the size of \(4\times 6\times 6\) and pass through five convolutional layers with 256 filters (Level 3).

The decoder stage starts from the feature maps generated in Level 3 and applies a deconvolution layer with a kernel \(1\times 2\times 2\) and stride \(1\times 2\times 2\) to match the size of the features maps in Level 2. The 2 features maps are concatenated followed by two convolutional layers and the process is repeated to get features maps in Level 1 as it is suggested in the U-Net [11] architecture. With a final output of size \(8\times 24\times 24\times 64\), the last convolution operation is applied to reduce the number of channels to the number of labels defined by the dataset, followed by a softmax activation function for more than one output label or sigmoid if there is a single label.

Moreover, we add transition layers [3] between the encoder and the decoder. Specifically, the feature maps generated in the last layer at Level 1 on the encoder, pass through a convolutional layer with 16 filters and then are connected to the decoder. Similarly, the feature maps at Level 2 pass through a convolutional layer with 32 filters and then are connected to the decoder.

Fig. 3.
figure 3

2D visualization of the patch extraction and prediction image reconstruction. FLAIR, T1, and IR patches are shown on the left and a partially reconstructed prediction image is shown on the right.

3.2 Loss Function

The loss function was defined as a normalized categorical cross entropy loss (1) [13], in other words, for each class the logarithmic errors are added and then divided by the number of voxels belonging to the class which is given by the ground truth. In this way, the gradients of smaller objects have a bigger impact on the optimization process. The loss is define as follows

$$\begin{aligned} \varPsi = -\sum _{c=1} ^{C} \frac{\sum _{n=1} ^{N} y _{c,n} log( \,f( \,\varTheta ,x _{c,n})) \,}{\sum _{n=1} ^{N} y _{c,n}}, \end{aligned}$$
(1)

where C is the number of clases, N is the number of voxels in the c class, \(C\times N\) is equal to the number of voxels in the minibatch, f it the mapping function from inputs to outputs and \(\varTheta \) are the trainable weights in each convolutional layer.

3.3 Implementation Details

In training stage, we randomly sampled 3D patches from the input MRI images in the training set to get a new minibatch each step and then learn the 3D CNN which can predict the labels in the patch. We trained the model with Adam optimizer, learning rate 1e-4, learning rate decay of 10% after 40 thousand steps for 500 thousand steps. Our model was implemented in TensorFlow and trained on Nvidia Titan XP GPUs.

In the inference stage, we extracted 3D image patches from each MRI modality, applied the intensity normalization and concatenated these 3D volumes to run CNN model and get a prediction map (see Fig. 3). 3D patches are extracted from the testing images with a stride of \(4\times 12\times 12\). This configuration ensures that every voxel is predicted 8 times with a score assigned to every class. The class that scores the highest is selected as a final prediction for that specific voxel.

4 Experimental Results

The proposed method was evaluated on the MRBrainS18 challenge dataset [9]. Due to the limited training data, we first searched reasonable hyper-parameters by performing a leave-one-out cross validation on 7 training data samples. Thereafter, we trained a network with all 7 training data samples using the hyper-parameters. Including our method, the remaining challenge methods were evaluated on the hidden 23 test data samples. Most methods were based on the deep neural networks such as variations of 2D and 3D U-Net [11], DeepMedic network [8], multiple networks ensemble models, and only one method was based on a traditional generative model using Bayes theory on two proirs, a probabilistic tissue atlas and their intensity distribution.

Table 1. Results of our method on the challenge dataset.

The segmentation accuracy was evaluated by multiple metrics such as mean Dice coefficient, mean volume similarity and mean 95% Hausdorff distance. Every metric was calculated for each label and then averaged across the test dataset. Table 1 shows the average scores for eight labels and Fig. 4 shows box plots representing the distributions of accuracy scores on 23 testing data.

The cerebellum was well segmented across the all testing subjects; the highest Dice coefficient and volume similarity were achieved without outliers in this structure. Although the basal ganglia, ventricles and brain stem were the structures with most number of outliers in all metrics, the average scores were relatively high. The gray matter, white matter and CSF achieved good performance in terms of Hausdorff distance and volume similarity metrics, but relatively lower Dice coefficient than other structures. Due to the large size of this structures, the volume similarity does not give much information, but the distance metric indicates that the misclassified voxels are close to the boundary. According to the Hausdorff distance, there are several structures with a score close to 3 mm, which happens to be the spacing at the z axis on the dataset. Those structures are more likely to have misclassify voxels on the z axis rather than on the x or y axis.

The WMH class was the most challenging among all 8 structures. These are abnormalities in the white matter which means they do not have any regular shape, size or location. In general, our proposed model predictions were similar to the ground truth, volume similarity show that they were around 85% accurate, but a mean Hausdorff distance around 10 mm shows the presence of small False positive or False negative on entire WMH regions which also explains the lower Dice coefficient result. There are 2 outliers for the WMH category (see Fig. 4), the first one is a healty subject with a False positive region, or a subject with False negative regions, both examples will produce a 0 score in Dice coefficient and volume similarity. The second outlier seems to have predicted WMH regions outside the brain given its high Hausdorff distance and the outlier presence on the volume similarity.

Fig. 4.
figure 4

Challenge metrics results on our method

Table 2. Competition results.

To compare overall scores of submitted methods, a weighted version of each metric was calculated by a method described on the challenge website [9]. Specifically, all metrics were weighted, per segmentation label, to the 95% range of the original 22 participants in the challenge. Then, the weighted results were summed up to get a final score in the range of 8.016 to 9.971, where a higher score means a better performance.

Table 2 shows the performances of comparison methods. As shown in Table 2, our proposed method, i.e. MISPL, obtained the highest overall score on the challenge. Most methods achieved good results on relatively large regions such as cerebellum and ventricles, but achieved low scores on the WMH abnormalities because the WMH regions are usually much smaller than other tissues in the brain. Thus, it decreased the overall score significantly. The conventional categorical cross entropy gives more weight to bigger objects, resulting in an under-segmentation or completely miss of those small objects. By normalizing the loss function, our proposed method benefited from having fewer outliers than other competitors and a better segmentation of the WMH regions, even though it was also the lowest scored class among all classes in our result.

In summary, the proposed model ranked first in all three metrics, which means it learned the most general features and also, it regularized well from the very limited training data available. This points out the advantage offered by the use of small 3D patches, a normalized loss function and pooling layers, even if the 3D volume seemed to be already small.

5 Conclusion

We proposed a 3D patchwise convolutional neural network method to address the brain MRI segmentation task. Our network consists of encoder-decoder with transition layers, batch normalization and is trained with a normalized cross entropy. The transition layer and batch normalization regularized the model during training, which was vital given the very small amount of training data. The transition layer gives more weights to higher level features, and thus the model can learn the features required to segment large and small regions from a small 3D patch. We have also demonstrated the impact of the new loss function through the experiments. By analyzing three metrics on the MRbrainS18 challenge, we conclude that our method is effective to segment the whole brain region of interests.