Keywords

1 Introduction

Early detection of lung cancer from clinically acquired computed tomography (CT) scans are essential for lung cancer diagnosis [1]. Lung cancer detection is a binary classification (cancer or non-cancer) task from the machine learning perspective. Convolutional neural network (CNN) methods have been widely used in lung cancer detection, which typically consist of two steps: nodule detection and classification. Nodule detection detects the pulmonary nodules from a CT scan with coordinates and region of interest (e.g., [2]), while the classification assigns the nodules to be either benign or malignant categories [3], and the whole CT scan is classified as cancer when containing at least one malignant nodule. One prevalent method was proposed by Liao et al. [3], which won the Kaggle DSB2017 challenge. In this method, the pipeline was deployed on detecting top five confidence nodule regions to classify whole CT scan. The Liao et al. network focuses on a single CT scan, rather than multiple longitudinal scans.

In clinical practice, longitudinal CT scans may contain temporal relevant diagnostic information. To learn from the longitudinal scans, recurrent neural networks (RNN) have been introduced to medical image analysis when longitudinal (sequential) imaging data are available (e.g., [4]). Long Short-Term Memory (LSTM) [5] is one of the most prevalent variants of RNN, which is capable of learning both long-term and short-term dependencies between features using three gates (i.e., forget, input, and output gates). Many variants of LSTM have been proposed [6,7,8]. For instance, convolutional LSTM [6] is designed to deal with spatial temporal variations in images [9, 10].

In canonical LSTM, the temporal intervals between consecutive scans are equal. However, this rarely occurs in clinical practice. Temporal intervals have been modeled in LSTM for recommendation system in finance [8] and abnormality detection on 2D chest X-ray [11]. However, no previous studies have been conducted to model global temporal variations. The previous methods [8, 11] modeled the relative local time intervals between consecutive scans. However, for lung cancer detection, the last scan is typically the most informative. Therefore, we propose a new Temporal Emphasis Model (TEM) to model the global time interval between previous time points to the last scan as a global multiplicative function to input gate and forget gate, rather than a new gate as [8] or an additive term as [11].

Our contributions are: (1) this is the first study that models the time distance from last point for LSTM in lung cancer detection; (2) the novel DLSTM framework is proposed to model the temporal distance with adaptive forget gate and input gate; (3) a toy dataset called “Tumor-CIFAR” is released to simulate dummy benign and malignant cancer on natural images. 1794 subjects from the widely used National Lung Screening Trial (NLST) [12] and 1420 subjects from two institutional cohorts are used to evaluate the methods.

2 Theory and Method

Distanced LSTM - LSTM is the most widely used RNN networks in classification or prediction upon sequential data. Standard LSTM employ three gates (i.e., forget gate \( f_{t} \), input gate \( i_{t} \), and output gate \( o_{t} \)) to maintain internal states (i.e., hidden state \( H_{t} \) and cell state \( C_{t} \)). The forget gate controls the amount of information used for the current state from the previous time steps. To incorporate the “distance attribute” to LSTM, we multiply a Temporal Emphasis Model (TEM) \( D\left( {d_{t} ,a,c} \right) \) as a multiplicative function to the forget gate and the input gate with learnable parameters (Fig. 1).

Fig. 1.
figure 1

The framework of DLSTM (three “steps” in the example). \( x_{t} \) is the input data at time point t, and \( d_{t} \) is the time distance from the time point t to the latest time point. “F” represents the learnable DLSTM component (convolutional version in this paper). \( H_{t} \) and \( C_{t} \) are the hidden state and cell state, respectively. The input data, \( x_{t} \), could be 1D, 2D, or 3D.

Briefly, our DLSTM is defined by following the terms and variables in [6]:

$$ \begin{array}{*{20}l} {i_{t} = D\left( {d_{t} ,a,c} \right) \cdot \sigma \left( {W_{xi} * X_{t} + W_{hi} * H_{t - 1} + W_{ci} \circ C_{t - 1} + b_{i} } \right)} \hfill \\ {f_{t} = D\left( {d_{t - 1} ,a,c} \right) \cdot \sigma \left( {W_{xf} * X_{t} + W_{hf} * H_{t - 1} + W_{cf} \circ C_{t - 1} + b_{f} } \right)} \hfill \\ {C_{t} = f_{t} \circ C_{t - 1} + i_{t} \circ tanh\left( {W_{xc} * X_{t} + W_{hc} * H_{t - 1} + b_{i} } \right)} \hfill \\ {o_{t} = \sigma \left( {W_{xo} * X_{t} + W_{ho} * H_{t - 1} + W_{co} \circ C_{t} + b_{o} } \right)} \hfill \\ {H_{t} = o_{t} \circ tanh\left( {C_{t} } \right)} \hfill \\ \end{array} $$
(1)

where \( x_{t} \) is the input data of time point \( t \), \( d_{t} \) is the global time distance from any \( x_{t} \) to the latest scan, \( W \) and \( b \) are the learnable parameters, and “*” and “\( \circ \)” denotes the convolution operator and Hadamard product respectively. Different from canonical LSTM, the TEM function is introduced in the proposed DLSTM as

$$ D\left( {d_{t} ,a,c} \right) = a \cdot e^{{ - c \cdot d_{t} }} $$
(2)

where \( a \) and \( c \) are positive learnable parameters. Different from tLSTM [11], which introduced an additive term to model local relative time interval between scans, the proposed DLSTM introduces the TEM function as a global multiplicative function to model the time interval (distance) from each scan to the last scan. Using TEM in Eq. (1), both the forget gate \( f_{t} \) and input gate \( i_{t} \) are weakened if the input scan is far from the last scan. Note the “LSTM” represents the convolutional version in this paper.

3 Experiment Design and Results

We include both simulation (Tumor-CIFAR) and empirical validations (NLST and clinical data from two in-house projects, see Table 1) to validate the baseline methods and the proposed method. Firstly, to test if our algorithm can handle the time-interval distances effectively, we introduce the synthetic dataset: Tumor-CIFAR.

Table 1. Demographic distribution in our experiments

In Tumor-CIFAR, we show the test results with a training/validation/test split (Fig. 2). We perform three different validations on lung datasets: (1) cross-validation on NLST with longitudinal data (Table 2); (2) cross-validation on clinical data with both cross-sectional and longitudinal scans (Table 3); and (3) external-validation on longitudinal scans (train and validation on NLST and test result on clinical data, Table 4).

Fig. 2.
figure 2

The receiver operating characteristic (ROC) curves of Tumor-CIFAR. The left panels simulate the situation that the images are sampled with the same interval distribution, while the right panels are sampled with the same size distribution. The upper panels show the examples of images in Tumor-CIFAR. The noise (white and black dots) are added, while the dummy nodules are shown as white blobs (some are indicated by red arrows). The lower panels show the Area Under the Curve of ROC (AUC) values of different methods. (Color figure online)

Table 2. Experimental results on NLST dataset (%, average (std) of cross-validation)
Table 3. Experimental results on clinical datasets (%, average (std) of cross-validation)
Table 4. Experimental results on cross-dataset test (external-validation)

3.1 Simulation: Tumor-CIFAR

Data.

Based on [13], the growth speed of malignant nodules is approximate three times faster compared with benign ones. To incorporate temporal variations in the simulation, we add dummy nodules on CIFAR10 [14] with different growth rate for benign and malignant nodules (malignant nodules grow three times faster than benign ones).

Two cases are simulated: image samples with the same “interval distribution” (Fig. 2a), or with the same nodule “size distribution” (Fig. 2b). The same interval distribution indicates intervals follow the same Gaussian distribution. The same nodule size distribution represents the growth rate of nodules follow the same Gaussian distribution (simulation code, detail descriptions and more image examples are publicly available at https://github.com/MASILab/tumor-cifar).

There are 5,000 samples in the training set and 1,000 samples in the testing set. Cancer prevalence was 50% in each dataset. Each sample is simulated with five different time points. The training/validation/test split is 40k/10k/10k.

Experimental Design.

The base network structure (CNN in Fig. 1) is employed from the official PyTorch 0.41 [15] example for MNIST (we call it “ToyNet”). The ToyNet is composed of two convolutional layers (the second with a 2D dropout) and followed by two fully connected layers along with a 1D dropout in the middle. “LSTM” and “DLSTM” in Fig. 2 represents a 2D convolutional LSTM component and 2D convolutional of our proposed DLSTM component is stacked in the beginning of the “ToyNet”, respectively. The maximum training epoch number is 100. The initial learning rate set to 0.01 and is multiplied by 0.4 at 50th, 70th and 80th epoch.

Results.

For the same time interval distribution (Fig. 2a), the LSTM achieves higher performance compared with baseline CNN method, while the DLSTM works even better. This task is relatively easy since the malignant nodules clearly grow faster compared benign nodules. However, if we control the sampling strategy to guarantee the same nodule size for corresponding samples (Fig. 2b), the task becomes challenging if the time intervals are not modeled in the network design since the nodules are now having the same size. In this case, the CNN and LSTM only achieve 0.5 AUC values, while our DLSTM is able to almost perfectly capture the temporal variations with an AUC value of 0.995.

3.2 Empirical Validation on CT

Data.

The National Lung Screening Trial (NLST) [12] is a large-scale randomized controlled trial for early diagnosis of lung cancer study with low-dose CT screening exams publicly available. We obtain a subset (1794 subjects) from NLST, which contains all longitudinal scans with “follow-up confirmed lung cancer”, as well as a random subset of all “follow-up confirmed not lung cancer” scans (Table 1). One in-house dataset combines two clinical lung sets Molecular Characterization Laboratories (MCL, https://mcl.nci.nih.gov) and Vanderbilt Lung Screening Program (VLSP, https://www.vumc.org/radiology/lung) which is also evaluated by our algorithm. These data are used in de-identified form under internal review board supervision.

Experimental Design.

The DLSTM can be trained in an end-to-end network (simulation experiments in Sect. 3.1) or as lightweight post-processing manner. In this section, we evaluate the proposed DLSTM as a post processing network for the imaging features extracted from Liao et al. [3]. We compare the DLSTM with a recently proposed benchmark tLSTM [11], which models the relative time interval as an additive term. Five highest risk regions (possible nodules) for each scan are detected by [3], and the feature dimension for each region is 64, then the scan-level feature is achieved by concatenating region features as 5 × 64 inputs. For a fair comparison, the same features are provided to the networks Multi-channel CNN (MC-CNN), LSTM, tLSTM, and DLSTM, with 1D convolutional layer of 5 kernel size. MC-CNN concatenates multi-scan features in the “channel” dimension. The maximum training epoch number is 100, the initial learning rate is set to 0.01 and multiplied by 0.4 at the 50th, 70th, and 80th epoch. Since most of the longitudinal lung CT scans contain two time points, we evaluate the MC-CNN, LSTM, tLSTM and DLSTM with two time points (“2 steps”) in this study (the last two points are picked if the patient with more than two scans).

The “Ori CNN” in Tables 2, 3 and 4 represents the results obtained by open source code and trained model of [3]. If there is on special explanation, our results are reported at subject-level rather than scan-level, and the “Ori CNN” reports the performance of the latest scan of patients.

Preprocessing.

Our preprocessing follows Liao et al. [3]. We resample the 3D volume to \( 1 \times 1 \times 1 \) mm isotropic resolution. The lung CT scan is segmented using (https://github.com/lfz/DSB2017) from the original CT volume and the non-lung regions are zero-padded to Hounsfield unit score of 170. Then, the 3D volumes are resized to 128 × 128 × 128 to use pre-trained model for extracting image features.

Results: Cross-validation on Longitudinal Scans.

Table 2 shows the five-fold cross-validation results on 1794 longitudinal subjects from the NLST dataset. All the training and validation data are longitudinal (with “2 steps”).

Results: Cross-validation on Combining Cross-sectional and Longitudinal Scans.

More than half of the patients only have cross-sectional CT (single time point) scans from clinical projects (see Table 1). Therefore, we evaluate the proposed method as well as the baseline methods on the entire clinical cohorts with both longitudinal and cross-sectional testing with cross-validation on all 1420 subjects by duplicating scans for subjects with only one scan to 2 steps. Table 3 indicates the five-fold cross-validation results on the clinical data. As for tLSTM [11] and the proposed DLSTM, we set the time interval and time distance to be zero for cross-sectional scans, respectively.

Results: External-validation on Longitudinal Scans.

We directly apply the trained models from NLST to the in-house subjects as external validation, without any further parameter tuning (Table 4). Note that the longitudinal data are regularly sampled in NLST while the clinical datasets are irregularly acquired. The final predicted cancer probability is the average of five models trained on five-folds of NLST. The “Ori CNN (all scans)” in Table 4 represents the scan-level results of all scans from longitudinal subjects.

Analyses:

In both public dataset NLST and our private datasets, the proposed DLSTM achieves competitive results in accuracy, AUC, F1, recall and precision. For example, the proposed DLSTM improves the conventional LSTM on F1 score from 0.6785 to 0.7085 (Table 2, NLST dataset), and from 0.7417 to 0.7611 (Table 3, clinical datasets). External validation experiments indicate the generalization ability of the proposed method.

In the external validation, (1) the latest scans achieve higher performance compare with longitudinal scans, which indicates that emphasis on latest longitudinal scan in our DLSTM is meaningful. (2) the algorithms with time information (tLSTM and the proposed DLSTM) outperform those methods without temporal emphasis when the test dataset is irregularly sampled.

4 Conclusion and Discussion

In this paper, we propose a novel DLSTM method to model the global temporal intervals between longitudinal CT scans for lung cancer detection. Our method has been validated using both simulations on Tumor-CIFAR, empirical validations on 1794 NLST and 1420 clinically subjects. From cross-validation and external-validation, the proposed DLSTM method achieves generally superior performance compared with baseline methods. Meanwhile, the Tumor-CIFAR dataset is publicly available.