. Author manuscript; available in PMC: 2025 Jan 1.

Published in final edited form as: Med Image Anal. 2023 Oct 13;91:102988. doi: 10.1016/j.media.2023.102988

Seeking an optimal approach for Computer-aided Diagnosis of Pulmonary Embolism

Nahid Ul Islam ^a, Zongwei Zhou ^b, Shiv Gehlot ^c, Michael B Gotway ^d, Jianming Liang ^c,^*

PMCID: PMC11039560 NIHMSID: NIHMS1986485 PMID: 37924750

Abstract

Pulmonary Embolism (PE) represents a thrombus (“blood clot”), usually originating from a lower extremity vein, that travels to the blood vessels in the lung, causing vascular obstruction and in some patients death. This disorder is commonly diagnosed using Computed Tomography Pulmonary Angiography (CTPA). Deep learning holds great promise for the Computer-aided Diagnosis (CAD) of PE. However, numerous deep learning methods, such as Convolutional Neural Networks (CNN) and Transformer-based models, exist for a given task, causing great confusion regarding the development of CAD systems for PE. To address this confusion, we present a comprehensive analysis of competing deep learning methods applicable to PE diagnosis based on four datasets. First, we use the RSNA PE dataset, which includes (weak) slice-level and exam-level labels, for PE classification and diagnosis, respectively. At the slice level, we compare CNNs with the Vision Transformer (ViT) and the Swin Transformer. We also investigate the impact of self-supervised versus (fully) supervised ImageNet pre-training, and transfer learning over training models from scratch. Additionally, at the exam level, we compare sequence model learning with our proposed transformer-based architecture, Embedding-based ViT (E-ViT). For the second and third datasets, we utilize the CAD-PE Challenge Dataset and Ferdowsi University of Mashad’s PE Dataset, where we convert (strong) clot-level masks into slice-level annotations to evaluate the optimal CNN model for slice-level PE classification. Finally, we use our in-house PE-CAD dataset, which contains (strong) clot-level masks. Here, we investigate the impact of our vessel-oriented image representations and self-supervised pre-training on PE false positive reduction at the clot level across image dimensions (2D, 2.5D, and 3D). Our experiments show that (1) transfer learning boosts performance despite differences between photographic images and CTPA scans; (2) self-supervised pre-training can surpass (fully) supervised pre-training; (3) transformer-based models demonstrate comparable performance but slower convergence compared with CNNs for slice-level PE classification; (4) model trained on the RSNA PE dataset demonstrates promising performance when tested on unseen datasets for slice-level PE classification; (5) our E-ViT framework excels in handling variable numbers of slices and outperforms sequence model learning for exam-level diagnosis; and (6) vessel-oriented image representation and self-supervised pre-training both enhance performance for PE false positive reduction across image dimensions. Our optimal approach surpasses state-of-the-art results on the RSNA PE dataset, enhancing AUC by 0.62% (slice-level) and 2.22% (exam-level). On our in-house PE-CAD dataset, 3D vessel-oriented images improve performance from 80.07% to 91.35%, a remarkable 11% gain. Codes are available at GitHub.com/JLiangLab/CAD_PE.

Keywords: Pulmonary embolism, CNNs, Vision transformer, Swin transformer, Transfer learning, Supervised learning, Self-supervised learning

1. Introduction

Pulmonary Embolism (PE) represents a thrombus (occasionally colloquially, and incorrectly, referred to as a “blood clot”), usually originating from a lower extremity or pelvic vein, that travels to blood vessels in the lung and causes vascular obstruction. PE causes more deaths than lung cancer, breast cancer, and colon cancer combined (U.S. Department of Health and Human Services Food and Drug Administration, 2008).

The current test of choice for PE diagnosis is computed tomography pulmonary angiography (CTPA) (Stein et al., 2006), but studies have shown a rate of under-diagnosis at 14% and over-diagnosis at 10% with CTPA (Lucassen et al., 2013). Computer-aided diagnosis (CAD) has shown great potential for improving the imaging diagnosis of PE (Masutani et al., 2002; Liang and Bi, 2007; Zhou et al., 2009; Tajbakhsh et al., 2015; Rajan et al., 2020; Huang et al., 2020a; Zhou et al., 2019; Zhou, 2021; Zhou et al., 2021a, 2017a). However, recent research in deep learning across academia and industry has resulted in numerous architectures, various model initializations, distinct learning paradigms, and data pre-processing producing many competing approaches to CAD implementation in medical imaging, resulting in great confusion in the CAD community.

To address this confusion and develop an optimal approach, we seek to answer the critical question: What deep learning architectures, model initializations, learning paradigms, and data pre-processing should be used for computer-aided diagnosis of pulmonary embolism? To answer this question, we have conducted extensive experiments with various deep learning methods applicable for PE diagnosis at both slice and exam-levels, using three publicly available PE datasets (Colak et al., 2021b; González et al., 2020; Masoudi et al., 2018) and an in-house dataset (Tajbakhsh et al., 2015).

Architectures.

Convolutional neural networks (CNNs) have been the default architectural choice for computer-aided diagnosis in medical imaging (Litjens et al., 2017; Deng et al., 2020). Alternatively, Transformers have proven powerful for Natural Language Processing (NLP) (Devlin et al., 2018; Brown et al., 2020), and have been quickly adopted for image analysis (Dosovitskiy et al., 2020; Han et al., 2021; Touvron et al., 2021), leading to Vision Transformer (ViT) (Dosovitskiy et al., 2020; Vaswani et al., 2017). ViT has shown competitive performance for medical imaging applications such as classification (Xiao et al., 2023) and segmentation (Shamshad et al., 2022). This study assesses the performance of the original ViT, Swin Transformer, and 12 CNN variants for the slice-level PE classification task. The ensemble of Xception (Chollet, 2017), SeXception, and SeResNext50 architecture (CNN-based architectures) demonstrates an improvement in slice-level PE classification performance, achieving gains of 4.97% and 0.30%, respectively, over the performance of the transformer-based models.

Model initializations.

Training deep models generally requires massive carefully annotated training datasets (Tajbakhsh et al., 2021; Haghighi et al., 2021). However, it is often prohibitive to create such large annotated datasets in medical imaging. Due to the lack of a sufficiently large annotated dataset, training a model from scratch may lead to suboptimal performance. Transfer learning provides a data-efficient alternative to this problem, whereby a model pre-trained on a source task (e.g., ImageNet) is fine-tuned on the related, but different, target task. Pre-training and fine-tuning have proven more effective than training models from scratch (Tajbakhsh et al., 2016; Zhou et al., 2017a). There are two major strategies to pre-train models (fully-supervised and self-supervised learning). Fully-supervised learning pre-trains a model using annotated data, whereas self-supervised learning does not require annotations (Jing and Tian, 2020; Haghighi et al., 2020; Chen et al., 2021). In this paper, we benchmark 12 different CNN architectures with fully-supervised pre-trained weights and evaluate 19 self-supervised methods for slice-level PE classification. Both SeLa-v2 (Asano et al., 2020) and DeepCluster-v2 (Caron et al., 2018) self-supervised methods surpass their fully-supervised counterparts with ~0.95% gain. We also evaluate Models Genesis (Zhou et al., 2021b) for the task of reducing false positives in 3D volume-based PE detection, achieving a performance improvement of 1.13% compared to training from scratch.

Learning paradigms.

For exam-level PE diagnosis, predictions are made for a collection of slices of CT scans. Due to the acquisition mechanism, a spatial correlation among the slices exists for each exam. Sequence model learning such as the recurrent neural network, long- and short-term memory, and the gated recurrent unit can exploit this spatial correlation in the datasets, leading to improved exam-level diagnosis performance. We tried the bidirectional Gated Recurrent Unit for the task of exam-level PE diagnosis. We have also investigated the application of a transformer-based model for exam-level diagnosis. Instead of following the traditional methods, we adopted the ViT architecture to handle slice-level embeddings, which offers a more efficient and flexible solution compared to raw images for exam-level diagnosis. This approach overcomes the limitations of high-dimensional images and enables the Transformer encoder to process an arbitrary number of slices for each patient. Similar to MIL-VT (Yu et al., 2021), we propose utilizing both class embedding and exam-level embedding for exam-level diagnosis. This approach achieves an 1.53% gain over the previous state-of-the-art performance for exam-level PE diagnosis.

Data pre-processing.

Representation of input data is critical for machine learning algorithms, and an optimal representation may significantly enhance performance. To this end, we explore vessel-oriented image representation (VOIR) (Tajbakhsh et al., 2015) compared with standard image representation for the task of reducing PE false positives. As an overview, VOIR follows these steps: (1) the generation of candidates, (2) vessel axis determination, and (3) selection of slices. In effect, VOIR can be used as a pre-processing step to obtain the performance gain over the conventional representation. We extend the VOIR (Tajbakhsh et al., 2019) into its 3D form as a data pre-processing and conduct experiments to demonstrate its effectiveness for the task of reducing PE false positives. VOIR outperforms standard image representation in 3D volume-based data with an AUC gain of 11.28%.

2. Related work

Earlier work in automated PE diagnosis focused on using pulmonary ventilation scans; the scans were passed through a neural network for PE detection (Patil et al., 1993; Tourassi et al., 1995; Scott and Palmer, 1993; Serpen et al., 2008), achieving modest success, but, with inadequate generalizability. Subsequently, CTPA scans replaced ventilation-perfusion scans for PE detection (Liang and Bi, 2007; Özkan et al., 2014; Park et al., 2010). However, these approaches were based on manual feature engineering, making them suboptimal and computationally complex. Recent advancements in deep learning led to end-to-end deep learning models for PE diagnosis, e.g., CNNs were used by Tajbakhsh et al. (2015). Additionally, the input is a vessel-aligned image representation rather than a standard representation. As opposed to 2D CNNs, Yang et al. (2019) divided the CTPA scan into smaller cubes and evaluated the cubes using 3D CNNs. Huang et al. (2020b) proposed PENet, a 3D CNN that uses multiple slices for PE prediction. Furthermore, Shi et al. (2020) utilized a two-stage PE detection framework that outperformed the PENet (Huang et al., 2020b). In the first stage, a CNN was trained with attention supervision using small pixel-level annotated slices, and in the second stage, an RNN outputted patient-level PE prediction. Rajan et al. (2020) also explored a two-stage framework, in which the first stage generated the PE candidates, and second stage utilized multiple instance learning (MIL) for PE detection. Similarly, Suman et al. (2021) used a two-stage framework to provide slice- and exam-level diagnosis, utilizing MIL in the second stage.

The top performing methods in the Radiological Society of North America (RSNA) Pulmonary Embolism Detection Challenge (Colak et al., 2021b) also utilized a two-stage framework for slice-level PE classification and exam-level diagnosis. For example, the first-place solution used a CNN in the first stage for slice-level classification. Subsequently, the slice-level embeddings from the CNN model were passed through a bidirectional gated recurrent unit to output exam-level diagnosis. Similarly, the second-place solution used two CNNs for slice-level embedding extraction, which were fed to DistilBERT (Sanh et al., 2019) for exam-level diagnosis. However, this approach was computationally expensive due to the incorporation of multiple CNNs. The third-place solution was similar to the first-place solution, albeit with two significant differences: (1) a penultimate layer was used for predicting seven fine-grained PE labels in the first stage, and (2) bidirectional LSTM replaced bidirectional GRU in the second stage.

Islam et al. (2021) first presented a benchmark on slice-level classification in the first stage and exam-level diagnosis in the second stage using the RSNA PE dataset (Colak et al., 2021b). This paper significantly extends the preliminary version substantially with the following enhancements:

An extensive benchmark with 12 fully-supervised and 19 self-supervised pre-trained models for slice-level PE classification (see Section 4.1).
A novel Embedding-based Vision Transformer (E-ViT) for exam-level PE diagnosis. Unlike the original ViT, the encoder in our E-ViT exploits both class embedding and exam-level embedding, resulting in a significant exam-level performance gain (see Section 4.2.2).
A comprehensive comparison between sequence model learning and our proposed E-ViT for the exam-level PE diagnosis (see Section 5.2).
A comparative analysis of the impact of vessel-oriented image representations and model initialization on PE false positive reduction in 2D, 2.5D, and 3D (see Section 3.4).

3. Materials

3.1. RSNA PE data

This dataset consists of 7279 CTPA exams, with a varying number of individual images or slices in each exam, using an image size of 512 × 512 pixels. The test set is created by randomly sampling 1000 exams, and the remaining 6279 exams form the training set. Correspondingly, there are 1,542,144 and 248,480 slices in the training and test sets, respectively. This dataset is annotated at both slice and exam-levels; that is, each slice has been annotated as either PE present or PE absent, and each exam has been further annotated with an additional nine labels, i.e., negative exam for PE, indeterminate, left PE, right PE, central PE, right ventricular/left ventricular (RV/LV) ratio > 1, RV LV ratio < 1, chronic PE and acute, and chronic PE.

Pre-processing.

Similar to the first-place solution for the RSNA Pulmonary Embolism Detection Challenge (Colak et al., 2021b), lung localization and windowing have been used as pre-processing steps. Lung localization removes irrelevant tissues and keeps the region of interest in the slices, whereas windowing highlights the pixel intensities within the range of [100, 700 HU]. Also, the slices are resized to 576 × 576 pixels. Fig. 1 illustrates these pre-processing steps in detail. We consider three adjacent slices from an exam as the 3-channel input of the model.

Fig. 1. — The pre-processing steps for slice-level classification. The figure shows the original CT scan followed by windowing and lung localization. For windowing, we clipped the pixel values above 100 HU and below 700 HU, while preserving the values in between.

3.2. CAD-PE challenge dataset

The CAD-PE Challenge Dataset comprises 91 computed tomography pulmonary angiograms (CTPA) that have been positively diagnosed with pulmonary embolism (PE) by at least one experienced radiologist (González et al., 2020). There are altogether 41,256 slices. Each scan has been segmented to highlight all the clots. The dataset was created for the ISBI challenge.

Pre-processing.

We take advantage of the provided ground truth masks to generate slice-level annotations. Similarly, we apply the same pre-processing steps to the dataset as we did for the RSNA PE dataset. We leverage the best CNN model trained on the RSNA PE dataset to assess the model’s performance using the CAD-PE challenge dataset (Section 5.1.1).

3.3. Ferdowsi University of Mashhad’s PE (FUMPE) dataset

FUMPE is a publicly available dataset that includes three-dimensional CTPA images of 35 patients, with a total of 8792 slices. For each image, two expert radiologists provided segmentation ground truths to identify the PE regions (Masoudi et al., 2018).

Pre-processing.

The FUMPE dataset includes segmentation masks that highlight the regions of PE. We investigate the dataset to identify which slices have mask values, and consider slices with mask values to be PE positive. We apply the same pre-processing procedure to the FUMPE dataset as we did for the RSNA PE dataset. We use the optimal CNN model trained on the RSNA PE dataset to evaluate the model’s performance on the FUMPE dataset (Section 5.1.1).

3.4. PE-CAD data

Our in-house PE-CAD dataset was annotated at the clot-level, that is, each clot was manually segmented. The dataset is comprised of 121 CTPA scans with 326 emboli and the spatial coordinate information of these emboli. At the patient level, the dataset is separated into a training set (71 patients) and a test set (50 patients). This dataset is analyzed in two stages for PE detection: PE candidate generation and false positive reduction. These two stages have been extensively covered in the existing literature. For example, Models Genesis (Zhou et al., 2019), a self-supervised model, was fine-tuned for a PE-CAD dataset as a target dataset. Here, Models Genesis was pre-trained on a dataset of chest CT scans called LUNA16 (Setio et al., 2016), achieving an AUC of 87.20 ± 2.9 for the task of PE false positive reduction. On the other hand, Parts2Whole (Feng et al., 2020) and TransVW (Haghighi et al., 2021) also evaluated their self-supervised methods on the PE-CAD dataset achieving AUCs of 86.14 ± 2.9 and 87.07 ± 2.8 via fine-tuning pre-trained models, respectively. In this paper, in addition to standard image representations, we also utilize the vessel-oriented image representation of this dataset as described in Tajbakhsh et al. (2019), and extend their image representation to full 3D (see pre-processing) for PE false positive reduction. To conduct a fair comparison with these prior studies, we compute candidate-level AUC by classifying true positives and false positives.

Pre-processing.

The approach developed in Liang and Bi (2007), Bi and Liang (2007) has been utilized for generating PE candidates based on heuristic lung segmentation and the tobogganing algorithm (Fairfield, 1990). The lung appears hypoattenuating or darker than its surroundings in chest CTPA scans. Therefore, to separate the lungs from the remainder of the scan, the voxel intensity values are clipped using a −400 HU threshold, resulting in a binary volume with the lungs and other dark regions appearing hyperattenuating or white. Then, a closing operation is performed to fill all dark gaps in the white region. A 3D connected component analysis eliminates non-lung areas, removing components with small volumes or a considerable length ratio between the major and minor axes. The segmentation of the lungs is intended to reduce computational time and the frequency of false positives for the tobogganing algorithm. As PEs exclusively appear within pulmonary arteries, exploring PE candidates outside the lungs is not required. The tobogganing algorithm is then applied specifically to the lung area, generating the PE candidate coordinates used to crop sub-volumes from the CTPA scan. The candidate generation process is detailed in Liang and Bi (2007), Bi and Liang (2007). After candidate generation, each candidate is labeled as “PE” or “non-PE” based on the clot-based ground truth masks. Tajbakhsh et al. (2015) developed a 3D vessel-oriented image presentation: A principal component analysis was performed to determine the vessel axis and the two orthogonal directions. By rotating the orthogonal direction, a number of cross-sectional and longitudinal image planes were obtained. Finally, a 3-channel image representation was created for each PE candidate by selecting the middle slices from each axis. This 3-channel image representation of each PE candidate is used to form the 2.5D orthogonal PE data. The 2D slice-based PE data was formed by copying the middle slice of the z-axis 3 times. Last, each of the candidates themselves from the patients was used as 3D volume-based PE data. Fig. 2 shows both standard PE representation and vessel-oriented PE representation appearances. We should note that the 2.5D image presentation is equivalent to that reported by Tajbakhsh et al. (2015), while the 2D image presentation is included in this paper as a baseline, and the 3D vessel-oriented image presentation has not been used in earlier studies.

Fig. 2. — Illustration of standard image representation and vessel-oriented image representation (VOIR) in their 2D, 2.5D, and 3D forms (detailed in Section 3.4). In VOIR, emboli consistently appear as an elongated structure in the longitudinal view and as a circular structure in the cross-sectional view.

4. Methods

4.1. Slice-level classification in RSNA PE dataset

Slice-level classification refers to determining the presence or absence of PE for each slice. In this section, we describe the configurations of fully-supervised and self-supervised transfer learning in our work.

4.1.1. Fine-tuning fully-supervised pre-trained models

In transfer learning, a model pre-trained on a different source task is fine-tuned on the target dataset (Zhou et al., 2017a). In this set of experiments, models pre-trained on ImageNet under a fully-supervised setting are fine-tuned for PE slice-level classification using the RSNA PE dataset. For this purpose, we have extensively examined 12 different CNN architectures (Fig. 4), Vision Transformer (ViT), and Swin Transformer. Also, inspired by SeResNext50 and SeResNet50 (Xie et al., 2017), we utilize squeeze and excitation (SE) blocks to the Xception architecture (SeXception).¹ We used only one epoch to train the models. The learning rate, batch size, and optimizer was 0.0004, 20, and Adam, respectively. We used four V100 GPUs to train and test the models. We have also explored the usefulness of ViT, in which the slices are reshaped into a sequence of patches. Upscaling the slice for a given patch size will effectively increase the number of patches, thereby enlarging the size of the training dataset. Similarly, the number of patches increases with a decrease in the patch size. Hence, to explore these two characteristics, we experimented with 32 × 32 and 16 × 16 patches, respectively (Devlin et al., 2018), and also with slices of different sizes. Here, we explored both Vision Transformer (ViT) and Swin Transformer pre-trained on ImageNet1k and ImageNet21k.

Fig. 4. — For all 12 architectures, transfer learning outperforms random initialization for slice-level PE classification, despite the pronounced difference between ImageNet and RSNA PE datasets. Mean AUC and standard deviation over 10 runs are reported for each architecture. Compared with the previous state of the art (SeResNext50), the SeXception architecture achieves a significant improvement (p = 1.68E−4). Details are provided in the appendix (see Table A.1).

4.1.2. Fine-tuning self-supervised pre-trained models

Self-supervised learning eliminates the need for explicit annotations by training the model on a pretext task. The pre-trained model can then be fine-tuned on downstream tasks such as classification and segmentation, as demonstrated in Hu et al. (2021). One example of a pretext task is reconstructing the original image from a distorted version using strong augmentations. When it comes to transfer learning, pre-trained SSL models can be used to initialize the model’s weights, instead of using randomly initialized weights or weights from supervised ImageNet models. In our study, we collected 19 publicly available pre-trained SSL models and fine-tuned them on the RSNA PE dataset for the task of PE slice-level classification (as shown in Fig. 8). All of the models used ResNet50 (He et al., 2016) as the backbone architecture. The aim of this experiment is to compare the performance of models initialized with self-supervised learning against those initialized with fully-supervised learning, in order to understand the benefits of using SSL models in transfer learning.

Fig. 8. — Self-supervised pre-training extracts more transferable features compared with supervised pre-training. The purple line represents the AUC of supervised pre-training, while the gray line represents the AUC of learning from scratch, with the shaded areas indicating the standard deviation. Seven of 19 SSL methods (dark blue) outperform the supervised pre-training. Here, the figure illustrates a comparison of results obtained using ResNet50 as the backbone, as we only have SSL pre-trained models publicly available.

4.2. Exam-level diagnosis in RSNA PE dataset

In addition to slice-level classification, the RSNA PE dataset also provides exam-level labels, in which only one label is assigned for each exam. We used the slice-level embeddings extracted from the models trained for slice-level PE classification for the data pre-processing step. We stacked all the extracted slice-level embeddings together, resulting in an N ×2048 embedding for each exam, where N denotes the number of slices per exam. Following the first-place solution, we computed the embedding difference between the current and the two direct neighbors and concatenated them with the current embedding of the slice. Therefore, the input for the exam-level diagnosis task is expanded to 6144 (2048×3). On the other hand, the number of slices per exam varies from exam to exam. To address this variability, the slice-level embeddings are reshaped to K × 6144 through padding or resizing. The first-place solution fixed K to be 192. In our approach, we take advantage of the varying number of slices per exam, rather than standardizing it. For the exam-level diagnosis task, we explored two learning paradigms in the following subsections.

4.2.1. Sequence model learning

As for sequence model learning, we employed the bidirectional Gated Recurrent Unit (BiGRU). BiGRU is a sequence model consisting of two GRUs (Cho et al., 2014). One GRU takes an input sequence in a forward direction while the other one proceeds in a backward direction. The first-place solution in the RSNA challenge used a single bidirectional Gated Recurrent Unit (BiGRU) to generate an output sequence. The output sequence goes through a max pooling and an attention weighted average pooling and then is concatenated as an exam-level embedding. The exam-level embedding undergoes 10 separate classification layers to predict nine exam-level labels and one slice-level label. The hidden size of the BiGRU is 512, and the architecture is trained for 25 epochs with a batch size of 64. The current implementation of the classification layer in the BiGRU model, which predicts slice-level labels, is flawed. As the number of slices per exam is inconsistent, the design uses linear interpolation on the slice-level ground truth to achieve a fixed size of 192, but in doing so, the expert annotation is compromised. A more effective approach would be to adopt a framework that is capable of handling variable numbers of slices and incorporates expert annotation to ensure more accurate predictions.

4.2.2. Embedding-based Vision Transformer (E-ViT)

We modified Vision Transformer (ViT) (Dosovitskiy et al., 2020) for the exam-level diagnosis as illustrated in Fig. 3. First, E-ViT takes a sequence of slice-level embeddings rather than image patches as input. The slice-level embeddings are extracted from the task of slice-level classification (Section 4.1), so they are more discriminative and informative than raw CT slices. Our E-ViT framework works with a varying number of slices, unlike the first-place solution which uses a fixed number of slices. Second, we use 10 class tokens (nine for each exam label and one for slice label) to address the multi-label classification task. Having the slice label alongside exam labels enforces the consistency between exam-level and slice-level predictions. Third, E-ViT discards the position embedding for each CT slice because neighboring slices in a CT scan appear similar; therefore, retaining the position embedding can lead to exam-level performance degradation, as evidenced in Table 6. In summary, E-ViT integrates class embedding with exam-level embedding to fully exploit the feature representations extracted from individual sequences with varying numbers of slices.

Table 6.

In our evaluation of the E-ViT framework, we investigate the impact of position embedding on the exam-level diagnosis. Our approach combines class embedding and exam-level embedding, and utilizes the SeResNext50 model to obtain the slice-level embeddings. The results, presented in the table, demonstrate a decrease in mean AUC score with the addition of position embedding, regardless of whether it is utilized together with class embedding, exam-level embedding, or as a standalone component. The experiments were repeated 10 times to calculate the standard deviation.

Class embedding	Position embedding	Exam-level embedding	Mean AUC
		✓	89.16 ± 0.08
	✓	✓	88.78 ± 0.10
✓	✓		88.72 ± 0.07
✓			88.94 ± 0.05
✓		✓	89.08 ± 0.05
✓	✓	✓	88.81 ± 0.09

Open in a new tab

We apply the same pre-processing to the embeddings extracted from the slice-level classification stage (presented in Section 4.2.1). We then incorporate a linear layer to reduce the slice-level embedding dimension from V × 6144 to V × 768 as the input of the Transformer encoder. Here, V represents the varying number of slices per exam as the number of slices is not consistent across different exams. Through our investigation of the RSNA PE dataset, we found that the upper bound of V can be set to 512, as it is sufficient for the detection of PE within the lung area. When working with varying numbers of slices per exam, it is important to consider how best to utilize the available data in order to ensure accurate predictions. In our approach, If the number of slices for a particular exam is greater than 512, we use the attention masking mechanism from the BERT implementation (Devlin et al., 2018) to ensure that only the relevant slices are used in the model. This allows us to focus on the most relevant information and improve the efficiency of the model. On the other hand, if the number of slices for a particular exam is above 512, we assume that most likely the slices contain regions other than the lungs. Therefore, we simply use the 512 slices belonging to the lung area to ensure that the model is only processing relevant information. This approach allows us to effectively handle varying numbers of slices and optimize the use of the available data.

The Transformer encoder uses a multi-head self-attention mechanism to output contextualized slice-level embedding (sized V × 768) and class embedding (sized 10 × 768). We concatenate the contextualized slice-level embedding which underwent attention weighted average pooling and max pooling together. The concatenated exam-level embedding is then fed to 10 separate classification layers to predict the nine exam-level labels and one slice-level label. Similarly, each class embedding is utilized to predict the corresponding label through 10 different classification layers. The class embedding and exam-level embedding are integrated using the formula Avg(CE, AMP(EE)). Here, CE represents class embedding, and AMP (EE) represents the exam-level embedding (EE) which is the concatenation output of attention weighted average pooling and max pooling for contextualized slice-level embedding from the Transformer encoder. We name this architecture Embedding-based Vision Transformer (E-ViT). Also, this architecture is trained with the mean binary cross entropy loss from exam-level embedding and class embedding predictions. During inference, the mean probability score from these two branches is utilized for making the final exam-level diagnosis. In regards to the initial set of hyperparameters, we have utilized the original Vision Transformer (Dosovitskiy et al., 2020), specifically ViT-B. However, to further optimize the performance of our proposed E-ViT model, we conducted several experiments to find the best combination of hyperparameters such as learning rate, optimizer, batch size, and total number of epochs. According to our experiments, the best combination of hyperparameters for the E-ViT model are reported in Table 1.

Table 1.

Hyperparameter settings used for the E-ViT model.

Hyperparameter	Value
Learning rate	0.001
Optimizer	Adam
Batch size	32
Learning rate scheduler	ReduceLROnPlateau
Total number of epochs	100

Open in a new tab

4.3. False positive reduction in PE-CAD dataset

As discussed in Section 3.4, the PE candidates are generated with location coordinates. The sub-volumes are cropped to ensure that PE candidates are in the center of each sub-volume. We compute candidate-level AUC for classifying true positives and false positives for comparison with prior studies (Zhou et al., 2017a; Tajbakhsh et al., 2016, 2019). We highlight the significance of vessel-oriented representation over the standard image representation. We incorporate ResNet18 as our backbone for PE false positive reduction, and evaluate performance with training from scratch and transfer learning with ImageNet weights. We use a learning rate of 0.001, batch size of 32, Adam as an optimizer, and patience as 38. We utilize the early stopping technique to monitor the validation loss. The image size for 2D and 2.5D input data is 224 × 224, while we use an image size of 64 × 64 × 64 for 3D input data.

5. Results and discussions

In this section, we present an analysis of the slice-level classification and exam-level diagnosis on the RSNA PE dataset, followed by the PE false positive reduction on the PE-CAD dataset.

5.1. Slice-level classification

5.1.1. Fine-tuning pre-trained models outperforms training from scratch

Fig. 4 shows a significant performance gain for every pre-trained model compared with random initialization. There is also a moderate positive correlation of 0.5914 between ImageNet and PE classification performance across different architectures (Fig. 5). The correlation indicates that the useful weights learned from ImageNet can be successfully transferred to the PE classification task, despite the modality difference between photographic images (ImageNet) and CTPA scans (RSNA PE). To demonstrate the interpretability of the trained model, we use GradCam++(Gildenblat and contributors, 2021) to visualize the attention map of the SeXception architecture, which performed the best as a standalone model. As shown in Fig. 6, the attention map effectively highlights the potential PE location in the slice. Additionally, to assess the generalizability of the SeXception model trained on the RSNA PE dataset, we directly test the model on two additional datasets: CAD-PE (González et al., 2020) and FUMPE (Masoudi et al., 2018). The results are summarized in Table 2. As shown in Fig. 7, the GradCam++ highlights the PE region. It is noteworthy that the model was only trained on the RSNA PE dataset and had not seen the two additional datasets prior to evaluation. We evaluated the two additional datasets for slice-level classification to determine whether a slice has PE or not. Here, the model had not seen the segmentation ground truth for these datasets either. Despite this, the model not only accurately classified slices with PE but also efficiently localized them. The green highlighted area in the figure represents the ground truths of the given segmentation mask, whereas the yellow highlighted area represents the predicted PE region. The experiments demonstrate not only the feasibility of using deep learning for diagnosing challenging radiologic findings such as PE on CTPA, but also the ability to accommodate data from external institutions employing different CT scanners and imaging protocols. Moreover, the superior performance of fine-tuning pre-trained models highlights the importance of transfer learning in medical image analysis and could have implications for improving the diagnostic accuracy and efficiency of PE detection in clinical practice. The SeXception model, trained on the RSNA PE dataset, demonstrates strong performance in PE slice-level classification on the CAD-PE Challenge and FUMPE datasets. According to Table 2, the SeXception achieves a sensitivity of 87.57% and a specificity of 87.60% on the CAD-PE Challenge dataset, and a sensitivity of 83.85% and a specificity of 83.95% on the FUMPE dataset. We also combine the slice-level classification result on the RSNA PE dataset by taking the average from Xception & SeXception, SeResNext50 & SeXception, and SeResNext50 & Xception & SeXception Table 4. The ensemble of SeResNext50 & Xception & SeXception achieves the best AUC performance of 96.76 ± 0.05 gaining 0.62% compared with the first-place solution.

Fig. 5. — There is a positive correlation between the results on ImageNet and RSNA PE dataset (R = 0.5914), suggesting that the transfer learning performance can be inferred by ImageNet pre-training performance.

Fig. 6. — The SeXception attention map highlights the potential PE location in the slice using GradCam++.

Table 2.

The pre-trained SeXception model, which was trained on the RSNA PE dataset, demonstrates promising results for PE slice-level classification on two additional, unseen datasets: the CAD-PE Challenge and FUMPE.

Dataset	No. of slices	Sensitivity (%)	Specificity (%)	AUC (%)
CAD-PE challenge	41,256	87.57	87.60	95.61
FUMPE	8,792	83.85	83.95	93.01

Open in a new tab

Fig. 7. — Evaluation of the SeXception model trained on the RSNA PE dataset on the CAD-PE Challenge Dataset (González et al., 2020) and Ferdowsi University of Mashhad’s PE (FUMPE) dataset (Masoudi et al., 2018) shows promising results for localizing the PE region on the two unseen datasets. The figure shows the ground truth (green) and predicted PE region (yellow)..

Table 4.

The embeddings extracted by the slice-level classification model are passed through BiGRU for exam-level diagnosis. However, no model performs optimally across all exam-level labels. We report the mean AUC over 10 runs and bolded the optimal results for each label.

Labels	SeResNext50^†	Xception	SeXception	Xception SeXception	SeResNext50 SeXception	SeResNext50 Xception SeXception
Slice-level	96.14 ± 0.11	96.07 ± 0.07	96.34 ± 0.09	96.69 ± 0.04	96.64 ± 0.06	96.76 ± 0.04
NegExam PE	91.37 ± 0.09	92.42 ± 0.05	92.61 ± 0.06	92.89 ± 0.04	92.34 ± 0.06	92.79 ± 0.04
Indetermine	88.02 ± 0.70	91.68 ± 0.29	88.57 ± 0.46	91.28 ± 0.27	90.75 ± 0.40	88.88 ± 0.43
Left PE	90.30 ± 0.11	91.19 ± 0.07	91.00 ± 0.11	91.32 ± 0.08	90.91 ± 0.07	91.18 ± 0.10
Right PE	93.68 ± 0.08	94.19 ± 0.05	94.55 ± 0.05	94.68 ± 0.03	94.27 ± 0.05	94.73 ± 0.04
Central PE	95.43 ± 0.18	95.00 ± 0.17	94.87 ± 0.25	95.07 ± 0.15	95.34 ± 0.16	94.97 ± 0.23
RV LV Ratio≥ 1	89.02 ± 0.18	89.24 ± 0.09	89.01 ± 0.11	89.44 ± 0.09	89.41 ± 0.12	89.28 ± 0.11
RV LV Ratio< 1	86.30 ± 0.12	87.22 ± 0.12	87.71 ± 0.11	88.07 ± 0.08	87.32 ± 0.07	87.91 ± 0.11
Chronic PE	72.54 ± 0.54	77.63 ± 0.36	73.61 ± 0.38	76.54 ± 0.34	75.81 ± 0.27	74.04 ± 0.38
Acute&Chronic PE	85.98 ± 0.31	83.52 ± 0.33	84.73 ± 0.25	84.23 ± 0.23	84.80 ± 0.22	84.80 ± 0.25
Mean AUC(%)	88.07 ± 0.12	89.12 ± 0.05	88.52 ± 0.07	89.28 ± 0.06	88.99 ± 0.08	88.73 ± 0.06

Open in a new tab

The ensemble of Xception, SeXception and SeResNext50 achieves a significant improvement (p = 5.94E−13) over the previous state of the art^†.

5.1.2. Self-supervised pre-training can surpass fully-supervised pre-training

As summarized in Fig. 8, SeLa-v2 (Asano et al., 2020) and DeepCluster-v2 (Caron et al., 2018) have achieved the best AUC of 95.68 ± 0.05 and 95.68 ± 0.06, followed by Barlow Twins (Zbontar et al., 2021). Seven of 19 self-supervised ImageNet models performed better than supervised pre-trained ResNet50 (Fig. 8). Six self-supervised methods underperform fully-supervised pre-trained ResNet50 but outperform learning from scratch. Finally, six self-supervised methods underperform ResNet50 learning from scratch. These results also highlight the importance of the pretext tasks for self-supervised learning, and the performance can vary significantly with these tasks for the same backbone. In the context of clinical practice, the self-supervised pre-training provides a good initialization for fine-tuning the downstream task of PE slice-level classification. However, an unfavorable pretext task leads to inferior performance on the downstream task compared to scratch training. We believe that a good pretext task based on large-scale PE detection will provide a stronger initialization for robust and efficient slice-level PE classification. On the other hand, SSL can be useful in the clinical practice of PE slice-level classification due to its ability to learn from unlabeled data. In a clinical setting, obtaining annotated data for PE classification can be challenging and time-consuming, and SSL can leverage the large amounts of available unlabeled data to learn a representation that can be used for classification tasks.

The details and comparison of all these self-supervised methods are provided in the appendix (see Table A.2).

5.1.3. Transformer-based models demonstrate comparable performance with slower convergence speed

Table 3 provides compelling evidence that the Swin Transformer (Swin-B), which was pre-trained using the SimMIM self-supervised approach, outperforms other transformer-based models, attaining an impressive AUC score of 96.46, even with a larger image size of 448 × 448. Notably, Swin-T, pre-trained with ImageNet1k, achieves a commendable AUC score of 96.30, coming remarkably close to the performance of the standalone CNN architecture (SeXception). However, it is important to note that fine-tuning transformer-based models required substantially more time and a greater number of training epochs to reach convergence. Also, the performance of the transformer-based models varies significantly with the training set size. As the transformer-based models are trained using the image patches, two approaches are followed to expand the training set:

The patch size is decreased for a given image size, effectively increasing the total number of patches.
The image size is increased for a given patch size, thus enhancing the total patch count.

The results demonstrate the crucial role of initialization as pre-training for transformer-based models, highlighting its significance (Ma et al., 2022). In contrast to CNNs, the number of parameters was large for transformer-based models and it took more than 16 h to finish one epoch of training. Although we have a huge number of slices, CNNs took only about 3 h to complete one epoch of training. Details are provided in the appendix (see Table A.3).

Table 3.

For slice-level PE classification task, ViT models exhibit inferior performance when compared to CNNs. Conversely, Swin Transformers demonstrate performance on par with SeXception. It is worth noting that both ViT and Swin Transformer models exhibit a tendency to converge at a slower rate. We evaluate the effectiveness of using ViT and Swin Transformer backbones for this task under various settings. Initiating these models with ImageNet pre-training significantly enhances performance, similar to CNNs. Additionally, initializing the weights from a self-supervised pre-trained model further improves performance.

Backbone	Image size	Patch size	Initialization	AUC (%)
SeXception	576	N/A	ImageNet1k	96.34
ViT-B	512	32	Random	82.12
ViT-B	224	32	ImageNet21k	84.56
ViT-B	512	32	ImageNet21k	88.47
ViT-B	512	16	Random	83.85
ViT-B	224	16	ImageNet21k	88.26
ViT-B	512	16	ImageNet21k	90.65
ViT-B	576	16	ImageNet21k	91.79
ViT-B	576	16	SimMIM	91.39
ViT-B	576	16	MoBY	90.71
Swin-B	224	16	Random	90.63
Swin-B	224	16	ImageNet1k	94.85
Swin-B	224	16	ImageNet21k	94.58
Swin-B	224	16	SimMIM	95.27
Swin-B	224	16	MoBY	94.56
Swin-T	224	16	ImageNet1k	88.36
Swin-B	448	16	SimMIM	96.46
Swin-T	448	16	ImageNet1k	96.30

Open in a new tab

5.1.4. Squeeze and excitation (SE) block further enhances CNN performance

Despite fewer parameters compared with many other architectures, SeXception provides an optimal average AUC of 96.34 ± 0.09. Thus, the SE blocks are parameter-efficient and have led to performance improvements for a variety of CNN architectures, such as ResNet50, ResNet101, ResNext50, ResNext101, and Xception (Fig. 9). This observation is consistent with (Hu et al., 2018a), which infer SE blocks capture feature dependencies in the channels to improve the performance. Here, SE blocks in a CNN enhance PE detection by finding relationships between consecutive CT slices, thereby improving the model’s ability to identify and classify PEs with higher accuracy and efficiency. Additionally, the SE blocks facilitate learning more discriminative features and improve the handling of large variability in medical images, resulting in enhanced accuracy and robustness in PE slice-level classification for CTPA data. Consequently, the utilization of SE blocks is deemed a valuable tool in clinical practice.

Fig. 9. — We observe a performance gain with the help of SE block. Note that all architectures are pre-trained from ImageNet.

5.2. Exam-level diagnosis

5.2.1. Sequence models benefit from more comprehensive slice-level embeddings

Table 4 summarizes the results of the sequential model learning (explained in Section 4.2.1) for exam-level diagnosis. Although the combination of SeResNext50, Xception and SeXception performs optimally for slice-level classification (see Table 4), this is not the case for exam-level diagnosis. As a standalone CNN backbone, Xception achieves the best mean AUC score across nine exam-level labels gaining 1.05% compared with the previous state of the art (SeResNext50). We also combine the exam-level diagnosis result by taking the average from Xception & SeXception, SeResNext50 & SeXception, and SeResNext50 & Xception & SeXception. We observe that Xception & SeXception achieve a significant gain of 1.21% (p = 5.94E−13) over the state of the art. Subsequently, we harnessed more extensive slice-level embeddings derived from Swin-B and Swin-T to enhance the exam-level diagnosis (Table 5). So far, Swin-T, acting as the backbone, has established itself as the leading performer for exam-level PE diagnosis, demonstrating an impressive AUC of 90.11 and achieving a notable improvement of 0.83 (p = 6.81E−14) compared to the ensemble of Xception and SeXception.

Table 5.

We used the extracted embeddings from the slice-level classification as input for our E-ViT for the exam-level diagnosis. FE and CE stand for exam-level embedding and class embedding from ViT, respectively. E-ViT represents the proposed framework where we calculate the loss by taking the average of FE and CE while training. At the inference stage, the output of FE and CE are averaged to obtain the exam-level prediction. We conduct each experiment 10 times and report the mean AUC for all the nine exam-level diagnosis with standard deviation. The best result is highlighted in bold.

Slice-level embeddings	Seq. models	Transformer models
Slice-level embeddings	BiGRU	FE only	CE only	E-ViT (ours)
SeResNext50	88.07 ± 0.13	89.16 ± 0.07	88.94 ± 0.05	89.08 ± 0.05
Xception	89.12 ± 0.05	89.20 ± 0.07	88.99 ± 0.06	89.12 ± 0.06
SeXception	88.52 ± 0.07	88.32 ± 0.08	88.36 ± 0.06	88.38 ± 0.07
Ensemble: Xception SeXception	89.28 ± 0.06	89.07 ± 0.15	89.04 ± 0.20	89.28 ± 0.07
Ensemble: Xception SeResNext50	88.99 ± 0.08	89.42 ± 0.11	89.46 ± 0.09	89.63 ± 0.04
Ensemble: Xception SeXception SeResNext50	88.73 ± 0.06	89.36 ± 0.12	89.39 ± 0.11	89.58 ± 0.06
Swin-B	88.84 ± 0.08	88.53 ± 0.09	88.56 ± 0.11	88.61 ± 0.07
Swin-T	90.11 ± 0.09	90.06 ± 0.12	90.07 ± 0.14	90.29 ± 0.12

Open in a new tab

5.2.2. Our E-ViT further exceeds sequence models

Table 5 demonstrates the mean AUC for all the exam-level predictions with the sequence model and our proposed Transformer-based model. In our evaluation, utilizing only class embedding for exam-level diagnosis results in performance comparable to the sequence model. Conversely, utilizing only the exam-level embedding from the Transformer encoder leads to superior performance, surpassing both the sequence model and E-ViT with class embedding alone. The integration of class embedding and exam-level embedding in our E-ViT framework results in even further improved performance for exam-level diagnosis. These observations are consistent across slice-level embeddings extracted from three different backbones (i.e., SeResNext50, Xception & SeXception). When the slice-level embeddings are extracted from SeResNext50 and used as inputs, our E-ViT outperforms sequence model by an AUC gain of 1.01%. We also analyze the performance achieved by combining two or three architectures together by simply taking the mean prediction of each exam-level across slice-level embeddings extracted from different backbones. The best AUC performance (89.63) is achieved from the ensemble of Xception and SeResNext50, gaining 1.56% over the first-place solution with a statistical significance of 2.19E-11 when using CNN architecture as the backbone. We further utilized slice-level embeddings from Swin-B and Swin-T for our E-ViT for the task of exam-level PE diagnosis. Notably, our findings revealed that the pairing of Swin-T as the backbone with our E-ViT yielded the most impressive performance, achieving an AUC of 90.29 for exam-level diagnosis. While the combination of Swin-T and BiGRU exhibited similar results, our E-ViT outperformed it significantly (p = 8.61E−4). In clinical practice, the number of slices in a CT scan can vary greatly between patients, making it challenging to use traditional models that require a fixed number of inputs. Our proposed Transformer-based model, E-ViT, is designed to tackle this issue by adapting to varying numbers of slices without losing any expert annotations. This is a crucial aspect of its use in clinical practice, as it allows for a more efficient and accurate diagnosis of PE. In the following, we present ablation studies on our E-ViT.

Position embedding decreases performance.

We conduct experiments to analyze the significance of position embedding in E-ViT during exam-level PE diagnosis. Position embedding is a learnable parameter added to patches in the original ViT architecture to retain positional information. We have slice-level embeddings rather than image patches as inputs to our E-ViT. In Table 6, we show that adding position embedding to slice-level embeddings decreases the performance of the model. Unlike photographic images where each patch constitutes distinctive visual features, consecutive slices in a CTPA scan are similar to one another. We hypothesize that similar consecutive slices leads to similar embeddings, thus making it complex for E-ViT to learn the position embedding.

Benefits of combining class and exam-level embeddings.

Original ViT only uses the class embedding for classification tasks and ignores all other features extracted from the Transformer encoder. The experiments are conducted to analyze the effectiveness of class embedding, exam-level embeddings, and both the embeddings used simultaneously for exam-level PE diagnosis. Table 5 shows that individual class embedding performs inferior to both exam-level embedding, and the combination of both embeddings for all models. Based on this observation, we hypothesize that class embedding cannot provide rich semantic features from similar-appearing medical slices. Exam-level embedding provide a better feature representation; moreover, fusing exam-level embedding with class embedding provides the best results.

Comparison with MIL-VT.

Multiple Instance Learning Enhanced Vision Transformer (MIL-VT) (Yu et al., 2021) introduced a multiple instance learning (MIL) head to the original ViT to enhance the learned features. The embeddings from the MIL head were combined with class embedding for final classification. Similarly, our E-ViT also uses both class embedding and exam-level embedding from the Transformer encoder, but it is different from MIL-VT regarding the architecture design of the MIL head. MIL-VT uses lower dimension embedding with an attention aggregator on the features extracted from the Transformer encoder, whereas we use attention weighted average pooling with max pooling. During training, we take the average loss of each branch (class embedding and exam-level embedding), whereas during testing we compute the mean predictions from each branch. Table 7 shows our E-ViT significantly outperforms MIL-VT by 0.28% AUC for exam-level diagnosis. We also observe that position embedding continues to decrease the performance of both E-ViT and MIL-VT.

Table 7.

Our proposed E-ViT framework outperforms MIL-VT significantly. Without the use of position embedding, E-ViT demonstrates a clear advantage. The incorporation of position embedding, however, leads to a decline in mean AUC score for both frameworks, narrowing the performance gap. Both frameworks were fed slice-level embeddings from the SeResNext50 model and the experiments were conducted 10 times to determine the p-value.

Position embedding	MIL-VT	E-ViT (Ours)	P-value
✓	88.83 ± 0.06	88.81 ± 0.09	4.23E-01
	88.80 ± 0.13	89.08 ± 0.05	3.17E-05

Open in a new tab

5.3. False positive reduction

5.3.1. 3D data offer higher performance than 2D and 2.5D data

While it may appear that using 3D models to handle 3D volumetric data is a reasonable choice, it comes with a high computing cost and a risk for overfitting (Zhou et al., 2021b). As a result, various alternative methodologies for reformatting 3D applications into 2D solutions have been presented. Regular 2D inputs were constructed by extracting neighboring axial slices (referred to as 2D slice-based input) by Ben-Cohen et al. (2016) and Sun et al. (2017). These 2D reformatted solutions create large amounts of data and take advantage of 2D models pre-trained on ImageNet. However, 2D solutions unavoidably compromise the rich spatial information in 3D volumetric data and the large capacity of 3D models. Alternatively, a more advanced technique, as detailed by Prasoon et al. (2013) and Roth et al. (2014, 2015), is to extract axial, coronal, and sagittal slices from volumetric data (referred to as 2.5D orthogonal input).

Table 8 compares 2D slice-based, 2.5D orthogonal, and 3D volume-based approaches as different image dimensions for PE false positive reduction. According to our analyses, 3D volume-based data consistently outperforms 2D and 2.5D data because it contains significantly more information. When the model is trained from scratch, 3D volume-based data outperforms 2.5D orthogonal data in both VOIR and non-VOIR scenarios (78.81% vs. 80.07% and 86.51% vs. 91.35%). Here, the 3D volume-based data significantly outperforms the 2D slice-based and 2.5D orthogonal approaches, with p-values of 7.94E−13 for non-VOIR and 2.19E−10 for VOIR scenarios, respectively. These results indicate that the use of 3D data in CTPA can lead to a reduction in false positive results compared to 2D or 2.5D data. This is because 3D data provides a more comprehensive and detailed view of the blood vessels in the lungs, which can aid in the accurate identification and differentiation of true positives from false positives. Overall, the use of 3D data in CTPA can improve the accuracy and reliability of the results, leading to better patient care in the clinical context.

Table 8.

We evaluate vessel-oriented image representation (VOIR) (Tajbakhsh et al., 2019) in comparison with 2D, 2.5D, and 3D solutions for the task of reducing PE false positives. All the results in the table are candidate-level AUC, including the mean and standard deviation across 10 trials.

False positive reduction (w/o VOIR)	Random	ImageNet	Models Genesis
2D slice-based input	63.03 ± 1.54	63.29 ± 1.12	72.11 ± 1.45
2.5D orthogonal input	78.81 ± 0.62	81.36 ± 0.67	85.34 ± 0.24
3D volume-based input	80.07 ± 4.30	N/A	87.66 ± 1.85
False positive reduction (w/t VOIR)	Random	ImageNet	Models Genesis
2D slice-based input	86.02 ± 2.10	85.81 ± 1.56	86.74 ± 1.47
2.5D orthogonal input	86.51 ± 3.52	87.29 ± 0.63	89.08 ± 0.69
3D volume-based input	91.35 ± 1.34	N/A	92.48 ± 0.95

Open in a new tab

5.3.2. VOIR is more informative than the standard image representation, boosting performance across image dimensions

Developing a computer-aided diagnosis (CAD) system that can accurately detect pulmonary embolism (PE) while maintaining a clinically acceptable level of false positives is highly desirable (Tajbakhsh et al., 2016). CTPA is the primary imaging modality used to diagnose PE, and appropriate pre-processing of these images is necessary to ensure accurate detection and reduce false positives. One approach to improve the performance of the model is through the use of vessel-oriented image representation (VOIR) (Tajbakhsh et al., 2019), which promotes consistency in the data and has been shown to be effective for PE detection. However, previous implementations of VOIR were limited to a 2D representation and did not fully exploit the 3D information available in CTPA images. In this study, we extend VOIR to a 3D form (Section 4.3) and evaluate its performance under various settings to improve the CAD system’s ability to accurately detect PE and reduce false positives for better clinical outcomes.

Table 8 summarizes the results with VOIR and non-VOIR using 2D, 2.5D, and 3D input data as different image dimensions. The AUC is higher with VOIR data in all image dimensions (2D, 2.5D, and 3D). As discussed in Section 5.3.1, although 2.5D data is expected to be more valuable than 2D data (Zhou, 2021), our results show that 2D data with VOIR significantly outperforms 2.5D data non-VOIR by an AUC gain of 7.21%. Similarly, 2D slice-based VOIR data outperforms 3D volume-based non-VOIR data (5.95% gain), even though 3D data contains significantly more information. These results show that vessel-oriented image representation is more informative than the standard image representation for reducing PE false positives.

5.3.3. Same domain transfer learning with self-supervised pre-training enhances performance across image representations and dimensions

One of the most commonly utilized paradigms in deep learning for medical image analysis is transfer learning from photographic images to medical images. Medical images, such as those used in radiology and diagnostic imaging, are distinct from photographic images in that they are often monochromatic and feature consistent anatomical structures. These characteristics can make medical images more challenging to interpret and analyze (Hosseinzadeh Taher et al., 2021). Therefore, same-domain transfer learning should be applied since a smaller domain gap makes the learned image representation more effective for target tasks (Zhou et al., 2021b). Models Genesis (Zhou et al., 2019), a self-supervised method, were developed to mitigate the domain gap. Specifically, Models Genesis were pre-trained on LUNA16 (Setio et al., 2016), a chest CT dataset that shares similar body parts and imaging modality as our PE-CAD dataset. We use Models Genesis with ResNet18 as the backbone for the task of PE false positive reductions. We also evaluate the recently published 3D self-supervised approaches (Guo et al., 2022), which were pre-trained on the LUNA16 dataset, to perform the same task. We anticipate improved performance in image representation (non-VOIR vs. VOIR) and dimensions (2D vs. 2.5D vs. 3D) compared to supervised pre-trained models due to their self-supervised nature.

According to Table 8, weights initialized from Models Genesis consistently outperform the ImageNet counterpart. For 2D slice-based non-VOIR data, Models Genesis initialization achieves an AUC of 72.11% ± 1.45 compared to 63.29% ± 1.54 with ImageNet initialization, achieving a significant gain of 8.82% (p-value = 2.63E−11). Similarly, for 2.5D orthogonal non-VOIR inputs, the gain is 2.98% with a p-value of 1.34E−9. A similar performance is observed for VOIR data, with a gain of 0.93% and 1.79% for 2D and 2.5D input data, respectively. Hence, self-supervised pre-training with Models Genesis enhances performance across image representations and dimensions. The results demonstrate the advantages of using the same domain transfer learning in medical image analysis. This approach involves utilizing models trained on similar medical image data to perform a new task. For instance, we found that transfer learning within the same domain is advantageous in medical image analysis as it allows the model to utilize its previously learned features to efficiently adapt to a new task (Zhou et al., 2019). As a result, same domain transfer learning can significantly enhance the accuracy and efficiency of medical image analysis in the clinical context, ultimately leading to better patient care and outcomes. Furthermore, we demonstrate our results by conducting an evaluation of the latest 3D self-supervised pre-trained models on the 3D VOIR dataset. According to Table 9, the TransVW model with the pre-training strategy of ((D) + R) + A performs best in reducing false positives on 3D VOIR data, achieving an AUC of 93.41%. Here, D represents the discriminative encoder, R is the restorative decoder, and A the adversary encoder.

Table 9.

We compare the performance of recent 3D self-supervised methods as outlined in Guo et al. (2022) on the 3D VOIR dataset, focusing on the task of reducing false positives. Our results are based on ten runs and are presented as mean and standard deviation values. Our study demonstrates that using a self-supervised pre-trained model with the same domain transfer learning can significantly improve false positive reduction performance on the 3D VOIR dataset.

Method	Approach	AUC (%)
TransVW (Haghighi et al., 2021)	((D)+R)	92.57 ± 0.76
TransVW (Haghighi et al., 2021)	(((D)+R)+A)	93.41 ± 0.77
Rubik’s Cube (Zhuang et al., 2019)	((D)+R)	92.39 ± 0.69
Rubik’s Cube (Zhuang et al., 2019)	(((D)+R)+A)	92.66 ± 0.57
Rotation (Gidaris et al., 2018)	((D)+R)	91.25 ± 1.82
Rotation (Gidaris et al., 2018)	(((D)+R)+A)	92.18 ± 0.68
Jigsaw (Noroozi and Favaro, 2016)	((D)+R)	92.30 ± 0.57
Jigsaw (Noroozi and Favaro, 2016)	(((D)+R)+A)	92.51 ± 0.51
Deep Clustering (Caron et al., 2018)	((D)+R)	92.30 ± 0.69
Deep Clustering (Caron et al., 2018)	(((D)+R)+A)	92.95 ± 1.13

Open in a new tab

6. Conclusion

For the RSNA PE dataset, the existing first-place solution utilizes SeResNext50 for slice-level classification and bidirectional GRU for exam-level diagnosis. Our optimized approach achieves a significant increase in the AUC for both slice-level classification and exam-level diagnosis. Through our rigorous analysis, we have determined that the optimal architecture for slice-level classification is an ensemble of Xception, SeXception and SeResNext50, resulting in an AUC gain of 0.62%. To further improve exam-level diagnosis performance, we propose a novel E-ViT model that offers a significant performance gain of 2.22%. Our experiments using our in-house PE-CAD dataset have shown that our vessel-oriented image representation, as a pre-processing step, is critical for reducing false positives, and has a considerable positive impact on performance across image dimensions, boosting the performance from 63.03% to 86.02% in 2D, from 78.81% to 86.51% in 2.5D, and from 80.07% to 91.35% in 3D. Moreover, the use of Models Genesis in pre-training has resulted in a significant improvement in performance across image representations and dimensions: from 72.11% to 86.74% in 2D, 85.34% to 89.08% in 2.5D. Finally, our study shows that using the self-supervised TransVW model with (((D)+R)+A) on the 3D VOIR data significantly improves false positive reduction performance, achieving an outstanding AUC of 93.41%.

Acknowledgments

This research has been supported in part by ASU and Mayo Clinic through a Seed Grant and an Innovation Grant, and in part by the NIH, United States under Award Number R01HL128785. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. This work has utilized the GPUs, made available in part by ASU Research Computing, Bridges-2 at the Pittsburgh Supercomputing Center (allocated under BCS190015), and Anvil at Purdue University (allocated under MED220025), all of which are supported by the Advanced Cyber-infrastructure Coordination Ecosystem: Services & Support (ACCESS) program. This program is supported by National Science Foundation, United States grants #2138259, #2138286, #2138307, #2137603, and #2138296. We thank Ruibin Feng for aggregating 19 self-supervised pre-trained models and Jae Shin and Douglas Amoo-Sargon for creating the 3D VOIR dataset. We also acknowledge the exploration and preliminary experiments by Utkarsh Nath, which have been redesigned and replaced in this version. We extend our gratitude to Zuwei Guo for his efforts in preparing the pre-trained models for the latest 3D self-supervised approaches and for his contributions in experimenting with the 3D VOIR target task. The content of this paper is covered by patents pending.

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Jianming Liang reports financial support was provided by National Institutes of Health. Jianming Liang has patent pending to Arizona State University. Jianming Liang is an associate editor of Medical Image Analysis.

Appendix A. Tabular results

Table A.1.

The tabular results of Fig. 4.

Backbone	Parameters	Random	ImageNet
ResNet101	44.5M	87.62±1.19	94.51±0.17
ResNet50	23.5M	90.37±1.32	94.73±0.12
DRN-A-50	23.5M	91.22±0.96	94.96±0.11
ResNet18	11M	88.74±1.66	95.20±0.07
ResNext50	25M	91.91±3.16	95.32±0.10
DenseNet121	6.9M	93.17±0.60	95.43±0.11
SeResNet50	28M	86.54±2.93	95.73±0.13
SeNet154	27.5M	87.84±2.92	96.07±0.12
SeResNet101	49M	89.90±1.91	95.76±0.10
Xception	20M	94.84±0.13	96.07±0.07
SeResNext50	27.5M	87.46±4.86	96.14±0.11
SeXception	21.5M	94.98±0.25	96.34±0.09

Open in a new tab

Table A.2.

The tabular results of Fig. 8.

Pre-training	Mean AUC
Random	90.37±1.32
ImageNet	94.73±0.12
SimSiam	85.96±1.06
PIRL	89.17±2.61
MoCo-v2	89.20±2.92
MoCo-v1	90.29±1.92
MoCo-v3	90.35±1.24
InfoMin	90.36±1.84
OBoW	90.96±2.21
InsDis	91.16±1.11
SimCLR-v2	93.66±0.29
PCL-v2	93.78±0.31
CLSA	94.21±0.25
PCL-v1	94.34±0.20
SimCLR-v1	95.45±0.11
DINO	95.60±0.10
BYOL	95.63±0.05
SwAV	95.63±0.10
BarlowTwins	95.66±0.07
SeLa-v2	95.68±0.29
DeepCluster-v2	95.68±0.06

Open in a new tab

Table A.3.

An overview of the parameter size and training speed for all models used in our research.

Stage	Model Name	Parameters	Speed
1	DenseNet121	6.9M	2h43m
	ResNet18	11M	3h10m
	Xception	20M	3h15m
	SeXception	21.5M	2h40m
	DRN-A-50	23.5M	3h50m
	ResNet50	23.5M	5h7m
	SeResNet50	28M	2h25m
	SeResNext50	27.5M	4h25m
	ResNext50	25M	5h33m
	SeResNet101	49M	6h25m
	SeNet154	27.5M	13h15m
	ResNet101	44.5M	6h10m
	ViT-B	88M	16h45m
	Swin-B	88M	18h40m
	Swin-T	28M	9h30m
2	BiGRU	20.5M	1h
	MIL-VT	111M	3h50m
	E-ViT	110.5M	4h

Open in a new tab

Appendix B. Backbone architectures

ResNet18, ResNet50 and ResNet101 (He et al., 2016):

One way to improve an architecture is to add more layers and make it deeper. Unfortunately, increasing the depth of a network cannot be simply accomplished by stacking layers together because such methodology introduces a problem called vanishing gradient. Moreover, the performance may become saturated or decreased overtime. The main idea behind ResNet is to have an identity shortcut connection which skips one or more layer. According to the authors, stacking layers should not decrease the performance of the network. The residual block allows the network to have identity mapping connections which prevents vanishing gradient. The authors presented several versions of ResNet models including ResNet-18, ResNet-34, ResNet-50 and ResNet-101. The numbers indicate how many layers exist within the architecture. The more layers represent a deeper network thus increasing the trainable parameters.

ResNext50 (Xie et al., 2017):

In ResNext50, the authors introduced a new dimension C, which is called Cardinality. Cardinality controls the size of the set of transformations in addition to the dimensions of depth and width. The authors argue that increasing cardinality is more effective than going deeper or wider in terms of layers. They used this architecture in the ILSVRC 2016 Classification Competition and won 2nd place. ResNext50 has a similar number of parameters for training and can boost performance, and could achieve nearly equivalent performance to ResNet101 although ResNet101 has deeper layers.

DenseNet121 (Huang et al., 2017):

Increasing the depth of a network results in performance improvement. However, a problem arises when the network is too deep. As a result, the path between input and output becomes too long which introduces a well-known issue called vanishing gradient. DenseNets simply redesign the connectivity pattern of the network so that the maximum information is flown. The main idea is to connect every layer directly with each other layer in a feed-forward fashion. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. According to the paper, the advantages of using DenseNet is that this approach alleviates the vanishing gradient problem, strengthen feature propagation, encourages feature reuse, and substantially reduces the number of parameters.

Xception (Chollet, 2017):

Xception network architecture was built on top of Inception-v3. It is also known as an extreme version of the Inception module. With a modified depthwise separable convolution, superior to Inception-v3. The original depthwise separable convolution approach is to perform depthwise convolution first, followed by a pointwise convolution. Here, Depthwise convolution is the channel-wise partial convolution, and pointwise convolution is the 1×1 convolution to change the dimension. This strategy is modified for Xception architecture. In Xception, the depthwise separable convolution performs 1×1 pointwise convolution first and then the channel-wise spatial convolution. Moreover, Xception and Inception-v3 has the same number of parameters. The Xception architecture slightly outperforms Inception-v3 on the ImageNet dataset and significantly outperforms Inception-v3 on a larger image classification dataset comprising of 350 million images and 17,000 classes.

DRN-A-50 (Yu et al., 2017):

Typically in an image classification task, the Convolutional Neural Network progressively reduces resolution until the image is represented by tiny feature-maps in which the spatial structure of the scene is not quite visible. This kind of spatial structure loss can hamper image classification accuracy as well as complicate the transfer of the model to a downstream task. This architecture introduces dilation, which increases the resolutions of the feature-maps without reducing the receptive field of individual neurons. Dilated residual networks (DRNs) can outperform their non-dilated counterparts in image classification task. This strategy does not increase the model’s depth or complexity. As a result, the number of parameters remains constant relative to their non-dialated counterparts.

SeNet154 (Hu et al., 2018a):

The convolution operator enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. This work focused on a channel-wise relationship and proposed a novel architectural unit called Squeeze-and-Excitation (SE) block. This SE block adaptively re-calibrates channel-wise feature responses by explicitly modelling inter-dependencies between channels. These blocks can also be stacked together to form a network architecture (SeNet154) that is generalized yet effective across different datasets. SeNet154 is one of the superior models used in ILSVRC 2017 Image Classification Challenge and won 1st place.

SeResNet50, SeResNet101, SeResNext50 and SeXception (Hu et al., 2018a):

The structure of the Squeeze-and-Excitation (SE) block is very simple and can be added to any state-of-the-art architectures by replacing components with their SE counterparts. SE blocks are also computationally lightweight and impose only a slight increase in model complexity and computational burden. Therefore, SE blocks were added to the ResNet50 and ResNext50 models when designing the new version. The pre-trained weights for SeResNet50 and SeResNext50 already exist whereas SeXception is not available publicly. By adding SE blocks, we created the SeXception architecture and trained on the ImageNet dataset to achieve the pre-trained weights. Subsequently, we used the pre-trained weights for our transfer learning schemes.

Appendix C. Self supervised methods

InsDis (Wu et al., 2018):

InsDis trains a non-parametric classifier to distinguish between individual instance classes based on NCE (noise-constrastive estimation) (Gutmann and Hyvärinen, 2010). Moreover, each instance of an image functions as its own distinct class for the classifier. InsDis also introduces a feature memory bank to maintain a large number of noise samples (referring to negative samples), thus avoiding exhaustive feature computing.

MoCo-v1 (He et al., 2020), MoCo-v2 (Chen et al., 2020a) and MoCo-v3 (Chen et al., 2021):

MoCo-v1 uses data augmentation to create two views of a same image X, referred to as positive samples. Similar to InsDis, images other than X are defined as negative samples and are stored in a memory bank. Moreover, to ensure the consistency of negative samples, a momentum encoder is introduced as the samples evolve during the training process. Basically, the proposed method aims to increase the similarity between positive samples while decreasing the similarity between negative samples. On the other hand, MoCo-v2 works similarly by adding a non-linear projection head, few more augmentations, cosine decay schedule, and a longer training time.

SimCLR-v1 (Chen et al., 2020b) and SimCLR-v2 (Chen et al., 2020c):

The key idea of SimCLR-v1 is similar to MoCo yet they were proposed independently. Here, SimCLR-v1 is trained in an end-to-end fashion with larger batch sizes instead of using special network architectures (a momentum encoder) or a memory bank. Within each batch, the negative samples are generated on the fly. However, SimCLR-v2 optimizes the previous version by increasing the capacity of the projection head and incorporating the memory mechanism from MoCo to provide more meaningful negative samples.

BYOL (Grill et al., 2020):

MoCo and SimCLR methods rely mainly on a large number of negative samples and require either a large memory bank or a large batch size. On the other hand, BYOL replaces the use of negative pairs by adding an online encoder, target encoder, and a predictor after the projector in the online encoder. Both the target encoder and online encoder computes features. The key idea is to maximize the agreement between the target encoder’s features and prediction from the online encoder. To prevent the collapsing problem, the target encoder is updated by the momentum mechanism.

PIRL (Misra and Maaten, 2020):

Both InsDis and MoCo take advantage of using instance discrimination. However, PIRL adapts the Jigsaw and Rotation as proxy tasks. Here, the positive samples are generated by applying Jigsaw shuffling or rotating by {0°, 90°, 180°, 270°}. Following InsDis, PIRL uses Noise-Contrastive Estimation (NCE) as loss function and a memory bank.

DeepCluster-v2 (Caron et al., 2021a):

DeepCluster (Caron et al., 2018) uses two phases to learn features. First, it uses self-labeling, where pseudo labels are generated by clustering data points using prior representation, yielding cluster indexes for each sample. Secondly, it uses feature-learning, where each sample’s cluster index is used as a classification target to train a model. Until the model is converged, the two phases mentioned are repeated. The DeepCluster-v2 minimizes the distance between each sample and the corresponding cluster centroid. DeepCluster-v2 also uses stronger data augmentation, an MLP projection head, a cosine decay schedule, and multi-cropping to improve the representation learning.

SeLa-v2 (Caron et al., 2021a):

SeLa also requires two-phase training (i.e. self-labeling and feature-learning). SeLa focuses on self-labeling as an optimal transport problem and solves it using Sinkhorn-Knopp algorithm. SeLa-v2 also uses stronger data augmentation, an MLP projection head, a cosine decay schedule, and multi-cropping to improve the representation learning.

PCL-v1 and PCL-v2 (Asano et al., 2020):

PCL-v1 aims to bridge contrastive learning with clustering. PCL-v1 adopts the same architecture as MoCo, including an online encoder and a momentum encoder. Following clustering-based feature learning, PCL-v1 also uses two phases (self-labeling and feature-learning). The features obtained from the momentum encoder are clustered in the self-labeling phase. On the other hand, PCL-v1 generalizes the NCE loss to ProtoNCE loss instead of classifying the cluster index with regular cross-entropy. This was done in PCL-v2 as an improvement step.

SwAV (Caron et al., 2021a):

SwAV uses both constrastive learning as well as clustering techniques. For each data sample, SwAV calculates cluster assignments (codes) with the help of Sinkhorn-Knopp algorithm. Moreover, SwAV works online performing assignments at the batch level instead of epoch level.

InfoMin (Tian et al., 2020):

InfoMin suggested that for contrastive learning, the optimal views depend upon the downstream task. For optimal selection, the mutual information between the views should be minimized while preserving the task-specific information.

Barlow Twins (Zbontar et al., 2021b):

The Barlow Twins consists of two identical networks fed with two distorted versions of the input sample. The network is trained such that the cross-correlation matrix between the two resultant embedding vectors is close to the identity. A regularization term is also included in the objective function to minimize redundancy between embedding vectors’ components.

SimSiam (Chen and He, 2021)

SimSiam uses a simple Siamese network to learn meaningful representation. Unlike the standard self-supervised methods, SimSiam does not use (i) negative sample pairs, (ii) large batches, and (iii) momentum encoders.

DINO (Caron et al., 2021b)

DINO can be interpreted as Knowledge Distillation with no labels. The transformed inputs are passed through both a teacher and student model. The network is trained to minimize the difference between the teacher and student outputs using cross-entropy loss.

OBoW (Gidaris et al., 2021)

OboW is a teacher-student learning paradigm that uses self-supervised training to learn bag-of-visual-words (BoW) representation from images. The teacher model generates BoW targets, and the student model learns the BoW representation for the perturbed input image.

CLSA (Wang and Qi, 2021)

In this paper, the authors propose a strong augmentation, which combines 14 types of augmentations: ShearX/Y, TranslateX/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Contrast, Color, Brightness, and Sharpness. Instead of directly applying strong augmentation while training, CLSA proposes to minimize the distribution divergence between strongly and weakly augmented images.

Footnotes

CRediT authorship contribution statement

Nahid Ul Islam: Methodology, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Zongwei Zhou: Investigation, Writing – review & editing. Shiv Gehlot: Investigation, Writing – original draft, Writing – review & editing. Michael B. Gotway: Resources, Data curation, Writing – review & editing, Funding acquisition. Jianming Liang: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition.

SeXception is trained on ImageNet from scratch and other models are from PyTorch.

Data availability

The GitHub link is included in the paper.

References

Asano YM, Rupprecht C, Vedaldi A, 2020. Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371 [Google Scholar]
Ben-Cohen A, Diamant I, Klang E, Amitai M, Greenspan H, 2016. Fully convolutional network for liver segmentation and lesions detection. In: Deep Learning and Data Labeling for Medical Applications. Springer, pp. 77–85. [Google Scholar]
Bi J, Liang J, 2007. Multiple instance learning of pulmonary embolism detection with geodesic distance along vascular structure. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1–8. [Google Scholar]
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. , 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 [Google Scholar]
Caron M, Bojanowski P, Joulin A, Douze M, 2018. Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 132–149. [Google Scholar]
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A, 2021a. Unsupervised learning of visual features by contrasting cluster assignments. arXiv: 2006.09882 [Google Scholar]
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A, 2021b. Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9650–9660. [Google Scholar]
Chen X, Fan H, Girshick R, He K, 2020a. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 [Google Scholar]
Chen X, He K, 2021. Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15750–15758. [Google Scholar]
Chen T, Kornblith S, Norouzi M, Hinton G, 2020b. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR, pp. 1597–1607. [Google Scholar]
Chen T, Kornblith S, Swersky K, Norouzi M, Hinton G, 2020c. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 [Google Scholar]
Chen X, Xie S, He K, 2021. An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9640–9649. [Google Scholar]
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y, 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv: 1409.1259 [Google Scholar]
Chollet F, 2017. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258. [Google Scholar]
Colak E, Kitamura FC, Hobbs SB, Wu CC, Lungren MP, Prevedello LM, Kalpathy-Cramer J, Ball RL, Shih G, Stein A, et al. , 2021b. The RSNA pulmonary embolism CT dataset. Radiol. Artif. Intell 3 (2), e200254. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deng S, Zhang X, Yan W, Eric I, Chang C, Fan Y, Lai M, Xu Y, 2020. Deep learning in digital pathology image analysis: A survey. Front. Med 1–18. [DOI] [PubMed] [Google Scholar]
Devlin J, Chang MW, Lee K, Toutanova K, 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 [Google Scholar]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. , 2020. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 [Google Scholar]
Fairfield J, 1990. Toboggan contrast enhancement for contrast segmentation. In: [1990] Proceedings. 10th International Conference on Pattern Recognition, Vol. 1. IEEE, pp. 712–716. [Google Scholar]
Feng R, Zhou Z, Gotway MB, Liang J, 2020. Parts2Whole: Self-supervised contrastive learning via reconstruction. In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. Springer, pp. 85–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gidaris S, Bursuc A, Puy G, Komodakis N, Cord M, Perez P, 2021. Obow: Online bag-of-visual-words generation for self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6830–6840. [Google Scholar]
Gidaris S, Singh P, Komodakis N, 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 [Google Scholar]
Gildenblat J, contributors, 2021. PyTorch library for CAM methods. GitHub, https://github.com/jacobgil/pytorch-grad-cam. [Google Scholar]
González G, Jimenez-Carretero D, Rodríguez-López S, Cano-Espinosa C, Cazorla M, Agarwal T, Agarwal V, Tajbakhsh N, Gotway MB, Liang J, et al. , 2020. Computer aided detection for pulmonary embolism challenge (cad-pe). arXiv preprint arXiv:2003.13440 [Google Scholar]
Grill JB, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG, et al. , 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 [Google Scholar]
Guo Z, Islam NU, Gotway MB, Liang J, 2022. Discriminative, restorative, and adversarial learning: Stepwise incremental pretraining. In: Domain Adaptation and Representation Transfer: 4th MICCAI Workshop, DART 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings. Springer, pp. 66–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gutmann M, Hyvärinen A, 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp. 297–304. [Google Scholar]
Haghighi F, Hosseinzadeh Taher MR, Zhou Z, Gotway MB, Liang J, 2020. Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration. In: MICCAI 2020. Springer International Publishing, Cham, pp. 137–147, URL: https://link.springer.com/chapter/10.1007%2F978-3-030-59710-8_14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haghighi F, Taher MRH, Zhou Z, Gotway MB, Liang J, 2021. Transferable visual words: Exploiting the semantics of anatomical patterns for self-supervised learning. arXiv:2102.10680 [DOI] [PMC free article] [PubMed] [Google Scholar]
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y, 2021. Transformer in transformer. arXiv preprint arXiv:2103.00112 [Google Scholar]
He K, Fan H, Wu Y, Xie S, Girshick R, 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR. [Google Scholar]
He K, Zhang X, Ren S, Sun J, 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778. [Google Scholar]
Hosseinzadeh Taher MR, Haghighi F, Feng R, Gotway MB, Liang J, 2021. A systematic benchmarking analysis of transfer learning for medical image analysis. In: Domain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health. Springer, pp. 3–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu D, Lu Q, Hong L, Hu H, Zhang Y, Li Z, Shen A, Feng J, 2021. How well self-supervised pre-training performs with streaming data? arXiv preprint arXiv:2104.12081 [Google Scholar]
Hu J, Shen L, Sun G, 2018a. Squeeze-and-excitation networks. In: CVPR. pp. 7132–7141. [Google Scholar]
Huang SC, Kothari T, Banerjee I, Chute C, Ball RL, Borus N, Huang A, Patel BN, Rajpurkar P, Irvin J, et al. , 2020a. PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. npj Digit. Med. 3 (1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang SC, Kothari T, Banerjee I, Chute C, Ball RL, Borus N, Huang A, Patel BN, Rajpurkar P, Irvin J, et al. , 2020b. PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. npj Digit. Med. 3 (1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708. [Google Scholar]
Islam NU, Gehlot S, Zhou Z, Gotway MB, Liang J, 2021. Seeking an optimal approach for computer-aided pulmonary embolism detection. In: International Workshop on Machine Learning in Medical Imaging. Springer, pp. 692–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jing L, Tian Y, 2020. Self-supervised visual feature learning with deep neural networks: A survey. TPAMI 1. 10.1109/TPAMI.2020.2992393. [DOI] [PubMed] [Google Scholar]
Liang J, Bi J, 2007. Computer aided detection of pulmonary embolism with tobogganing and multiple instance classification in CT pulmonary angiography. In: Biennial International Conference on Information Processing in Medical Imaging. Springer, pp. 630–641. [DOI] [PubMed] [Google Scholar]
Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI, 2017. A survey on deep learning in medical image analysis. Med. Image Anal 42, 60–88. [DOI] [PubMed] [Google Scholar]
Lucassen WA, Beenen LF, Büller HR, Erkens PM, Schaefer-Prokop CM, van den Berk IA, van Weert HC, 2013. Concerns in using multi-detector computed tomography for diagnosing pulmonary embolism in daily practice. A cross-sectional analysis using expert opinion as reference standard. Thromb. Res 131 (2), 145–149. [DOI] [PubMed] [Google Scholar]
Ma D, Hosseinzadeh Taher MR, Pang J, Islam NU, Haghighi F, Gotway MB, Liang J, 2022. Benchmarking and boosting transformers for medical image classification. In: Domain Adaptation and Representation Transfer: 4th MICCAI Workshop, DART 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings. Springer, pp. 12–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Masoudi M, Pourreza H-R, Saadatmand-Tarzjan M, Eftekhari N, Zargar FS, Rad MP, 2018. A new dataset of computed-tomography angiography images for computer-aided detection of pulmonary embolism. Sci. Data 5 (1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Masutani Y, MacMahon H, Doi K, 2002. Computerized detection of pulmonary embolism in spiral CT angiography based on volumetric image analysis. IEEE TMI 21 (12), 1517–1523. [DOI] [PubMed] [Google Scholar]
Misra I, Maaten L.v.d., 2020. Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6707–6717. [Google Scholar]
Noroozi M, Favaro P, 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision. Springer, pp. 69–84. [Google Scholar]
Özkan H, Osman O, Şahin S, Boz AF, 2014. A novel method for pulmonary embolism detection in CTA images. Comput. Methods Programs Biomed 113 (3), 757–766. [DOI] [PubMed] [Google Scholar]
Park SC, Chapman BE, Zheng B, 2010. A multistage approach to improve performance of computer-aided detection of pulmonary embolisms depicted on CT images: Preliminary investigation. IEEE Trans. Biomed. Eng 58 (6), 1519–1527. [DOI] [PubMed] [Google Scholar]
Patil S, Henry JW, Rubenfire M, Stein PD, 1993. Neural network in the clinical diagnosis of acute pulmonary embolism. Chest 104 (6), 1685–1689. [DOI] [PubMed] [Google Scholar]
Prasoon A, Petersen K, Igel C, Lauze F, Dam E, Nielsen M, 2013. Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 246–253. [DOI] [PubMed] [Google Scholar]
Rajan D, Beymer D, Abedin S, Dehghan E, 2020. Pi-PE: A pipeline for pulmonary embolism detection using sparsely annotated 3D CT images. In: Machine Learning for Health Workshop. PMLR, pp. 220–232. [Google Scholar]
Roth HR, Lu L, Liu J, Yao J, Seff A, Cherry K, Kim L, Summers RM, 2015. Improving computer-aided detection using convolutional neural networks and random view aggregation. IEEE Trans. Med. Imaging 35 (5), 1170–1181. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roth HR, Lu L, Seff A, Cherry KM, Hoffman J, Wang S, Liu J, Turkbey E, Summers RM, 2014. A new 2.5 D representation for lymph node detection using random sets of deep convolutional neural network observations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 520–527. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sanh V, Debut L, Chaumond J, Wolf T, 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 [Google Scholar]
Scott J, Palmer E, 1993. Neural network analysis of ventilation-perfusion lung scans. Radiology 186 (3), 661–664. [DOI] [PubMed] [Google Scholar]
Serpen G, Tekkedil D, Orra M, 2008. A knowledge-based artificial neural network classifier for pulmonary embolism diagnosis. Comput. Biol. Med 38 (2), 204–220. [DOI] [PubMed] [Google Scholar]
Setio AAA, Ciompi F, Litjens G, Gerke P, Jacobs C, Van Riel SJ, Wille MMW, Naqibullah M, Sánchez CI, Van Ginneken B, 2016. Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35 (5), 1160–1169. [DOI] [PubMed] [Google Scholar]
Shamshad F, Khan SH, Zamir SW, Khan MH, Hayat M, Khan FS, Fu H, 2022. Transformers in medical imaging: A survey. arXiv abs/2201.09873 [DOI] [PubMed] [Google Scholar]
Shi L, Rajan D, Abedin S, Yellapragada MS, Beymer D, Dehghan E, 2020. Automatic diagnosis of pulmonary embolism using an attention-guided framework: A large-scale study. In: Medical Imaging with Deep Learning. PMLR, pp. 743–754. [Google Scholar]
Stein PD, Fowler SE, Goodman LR, Gottschalk A, Hales CA, Hull RD, Leeper KV Jr., Popovich J Jr., Quinn DA, Sos TA, et al. , 2006. Multidetector computed tomography for acute pulmonary embolism. N. Engl. J. Med 354 (22), 2317–2327. [DOI] [PubMed] [Google Scholar]
Suman S, Singh G, Sakla N, Gattu R, Green J, Phatak T, Samaras D, Prasanna P, 2021. Attention based CNN-LSTM network for pulmonary embolism prediction on chest computed tomography pulmonary angiograms. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 356–366. [Google Scholar]
Sun C, Guo S, Zhang H, Li J, Chen M, Ma S, Jin L, Liu X, Li X, Qian X, 2017. Automatic segmentation of liver tumors from multiphase contrast-enhanced CT images based on FCNs. Artif. Intell. Med 83, 58–66. [DOI] [PubMed] [Google Scholar]
Tajbakhsh N, Gotway MB, Liang J, 2015. Computer-aided pulmonary embolism detection using a novel vessel-aligned multi-planar image representation and convolutional neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 62–69. [Google Scholar]
Tajbakhsh N, Roth H, Terzopoulos D, Liang J, 2021. Guest editorial annotation-efficient deep learning: The holy grail of medical imaging. IEEE Trans. Med. Imaging 40 (10), 2526–2533. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajbakhsh N, Shin JY, Gotway MB, Liang J, 2019. Computer-aided detection and visualization of pulmonary embolism using a novel, compact, and discriminative image representation. Med. Image Anal 58, 101541. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajbakhsh N, Shin JY, Gurudu SR, Hurst RT, Kendall CB, Gotway MB, Liang J, 2016. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Trans. Med. Imaging 35 (5), 1299–1312. [DOI] [PubMed] [Google Scholar]
Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P, 2020. What makes for good views for contrastive learning? arXiv preprint arXiv:2005.10243 [Google Scholar]
Tourassi GD, Floyd CE, Sostman HD, Coleman RE, 1995. Artificial neural network for diagnosis of acute pulmonary embolism: effect of case and observer selection. Radiology 194 (3), 889–893. [DOI] [PubMed] [Google Scholar]
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H, 2021. Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. PMLR, pp. 10347–10357. [Google Scholar]
U.S. Department of Health and Human Services Food and Drug Administration, 2008. The Surgeon General’s Call to Action to Prevent Deep Vein Thrombosis and Pulmonary Embolism. 10.1161/CIRCULATIONAHA.108.841403. [DOI] [PubMed] [Google Scholar]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 5998–6008. [Google Scholar]
Wang X, Qi GJ, 2021. Contrastive learning with stronger augmentations. arXiv preprint arXiv:2104.07713 [DOI] [PubMed] [Google Scholar]
Wu Z, Xiong Y, Yu SX, Lin D, 2018. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3733–3742. [Google Scholar]
Xiao J, Bai Y, Yuille A, Zhou Z, 2023. Delving into masked autoencoders for multi-label thorax disease classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3588–3600. [Google Scholar]
Xie S, Girshick R, Dollár P, Tu Z, He K, 2017. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500. [Google Scholar]
Yang X, Lin Y, Su J, Wang X, Li X, Lin J, Cheng K-T, 2019. A two-stage convolutional neural network for pulmonary embolism detection from CTPA images. IEEE Access 7, 84849–84857. [Google Scholar]
Yu F, Koltun V, Funkhouser T, 2017. Dilated residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 472–480. [Google Scholar]
Yu S, Ma K, Bi Q, Bian C, Ning M, He N, Li Y, Liu H, Zheng Y, 2021. Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 45–54. [Google Scholar]
Zbontar J, Jing L, Misra I, LeCun Y, Deny S, 2021. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230 [Google Scholar]
Zbontar J, Jing L, Misra I, LeCun Y, Deny S, 2021b. Barlow twins: Self-supervised learning via redundancy reduction.
Zhou Z, 2021. Towards Annotation-Efficient Deep Learning for Computer-Aided Diagnosis (Ph.D. thesis) Arizona State University. [Google Scholar]
Zhou C, Chan HP, Sahiner B, Hadjiiski LM, Chughtai A, Patel S, Wei J, Cascade PN, Kazerooni EA, 2009. Computer-aided detection of pulmonary embolism in computed tomographic pulmonary angiography (CTPA): Performance evaluation with independent data sets. Med. Phys 36 (8), 3385–3396. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Z, Shin JY, Gurudu SR, Gotway MB, Liang J, 2021a. Active, continual fine tuning of convolutional neural networks for reducing annotation efforts. Med. Image Anal 101997. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Z, Shin J, Zhang L, Gurudu S, Gotway M, Liang J, 2017a. Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Z, Sodha V, Pang J, Gotway MB, Liang J, 2021b. Models genesis. Med. Image Anal 67, 101840. 10.1016/j.media.2020.101840, URL: http://www.sciencedirect.com/science/article/pii/S1361841520302048. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Z, Sodha V, Rahman Siddiquee MM, Feng R, Tajbakhsh N, Gotway MB, Liang J, 2019. Models genesis: Generic autodidactic models for 3D medical image analysis. In: MICCAI 2019. Springer International Publishing, Cham, pp. 384–393, URL: https://link.springer.com/chapter/10.1007/978-3-030-32251-9_42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhuang X, Li Y, Hu Y, Ma K, Yang Y, Zheng Y, 2019. Self-supervised feature learning for 3d medical images by playing a rubik’s cube. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part IV 22. Springer, pp. 420–428. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The GitHub link is included in the paper.

PERMALINK

Seeking an optimal approach for Computer-aided Diagnosis of Pulmonary Embolism

Nahid Ul Islam

Zongwei Zhou

Shiv Gehlot

Michael B Gotway

Jianming Liang

Abstract

1. Introduction

Architectures.

Model initializations.

Learning paradigms.

Data pre-processing.

2. Related work

3. Materials

3.1. RSNA PE data

Pre-processing.

Fig. 1.

3.2. CAD-PE challenge dataset

Pre-processing.

3.3. Ferdowsi University of Mashhad’s PE (FUMPE) dataset

Pre-processing.

3.4. PE-CAD data

Pre-processing.

Fig. 2.

4. Methods

4.1. Slice-level classification in RSNA PE dataset

4.1.1. Fine-tuning fully-supervised pre-trained models

Fig. 4.

4.1.2. Fine-tuning self-supervised pre-trained models

Fig. 8.

4.2. Exam-level diagnosis in RSNA PE dataset

4.2.1. Sequence model learning

4.2.2. Embedding-based Vision Transformer (E-ViT)

Fig. 3.

Table 6.

Table 1.

4.3. False positive reduction in PE-CAD dataset

5. Results and discussions

5.1. Slice-level classification

5.1.1. Fine-tuning pre-trained models outperforms training from scratch

Fig. 5.

Fig. 6.

Table 2.

Fig. 7.

Table 4.

5.1.2. Self-supervised pre-training can surpass fully-supervised pre-training

5.1.3. Transformer-based models demonstrate comparable performance with slower convergence speed

Table 3.

5.1.4. Squeeze and excitation (SE) block further enhances CNN performance

Fig. 9.

5.2. Exam-level diagnosis

5.2.1. Sequence models benefit from more comprehensive slice-level embeddings

Table 5.

5.2.2. Our E-ViT further exceeds sequence models

Position embedding decreases performance.

Benefits of combining class and exam-level embeddings.

Comparison with MIL-VT.

Table 7.

5.3. False positive reduction

5.3.1. 3D data offer higher performance than 2D and 2.5D data

Table 8.

5.3.2. VOIR is more informative than the standard image representation, boosting performance across image dimensions

5.3.3. Same domain transfer learning with self-supervised pre-training enhances performance across image representations and dimensions

Table 9.

6. Conclusion

Acknowledgments

Declaration of competing interest

Appendix A. Tabular results

Table A.1.

Table A.2.

Table A.3.

Appendix B. Backbone architectures

ResNet18, ResNet50 and ResNet101 (He et al., 2016):

ResNext50 (Xie et al., 2017):

DenseNet121 (Huang et al., 2017):

Xception (Chollet, 2017):

DRN-A-50 (Yu et al., 2017):

SeNet154 (Hu et al., 2018a):

SeResNet50, SeResNet101, SeResNext50 and SeXception (Hu et al., 2018a):