Introduction

The National Lung Screening Trial (NLST) demonstrated that annual screening with low-dose helical computed tomography (LDCT) compared to chest radiography is associated with a 20% relative reduction in lung cancer mortality among high-risk individuals1. However, LDCT screening can lead to overdiagnosis and overtreatment of slow growing, indolent cancers that may pose no threat if left untreated2,3. Prior post-hoc analyses of the NLST have estimated that 18–22.5% of screen-detected cancers would not become symptomatic in a patient’s lifetime and would remain as indolent lung cancer4. At present there is limited data regarding the potential harmful impact of overdiagnosis on lung cancer outcomes; however, studies have suggested overdiagnosis is associated with increased operative mortality, severe disability among survivors, and reduction in longer term disease-free survival attributed to loss of pulmonary reserve5. Though clinical guidelines provide recommendations for the management of screen-detected nodules, there are limited tools to discriminate between indolent and aggressive lung cancers diagnosed in the lung cancer screening setting6,7,8,9. As such, biomarkers that can classify behavior of screen-detected lung cancers is an unmet clinical need since prior studies have suggested that 10 to 27% of lung cancers are over-diagnosed in lung cancer screening10,11,12,13.

Quantitative image features, also known as radiomics14, are non-invasive biomarkers that are generated from medical imaging and reflect the underlying tumor pathophysiology and heterogeneity. Radiomics have many advantages over circulating and tissue-based biomarkers as they are rapidly calculated from standard-of-care imaging and they reflect the entire tumor burden and not just a sample of the tumor in the case of tissue-based biomarkers. Our group8,15,16,17 and others18,19,20 have utilized radiomics in the lung cancer screening setting to improve risk prediction and diagnostic discrimination. To date, there have been limited efforts to use radiomics to predict tumor behavior and patient outcomes in the lung cancer screening setting.

Using publicly available data and LDCT images from the NLST, we generated radiomic features from screen-detected, incidentally-diagnosed lung cancers. Radiomic features describing size, shape, volume, and textural characteristics were calculated from the intratumoral region (area within the tumor) and from the peritumoral region (area surrounding the tumor parenchyma). The goal of this study was to utilize these peritumoral and intratumoral radiomics to identify a reproducible parsimonious model that predicts survival outcomes among lung cancer patients diagnosed in the lung cancer screening setting.

Results

Patient characteristics

There were no statistically significant differences between training and test cohorts for age, sex, smoking status, number of pack-years smoked, family history (FH) of lung cancer, histology, treatment, stage, and baseline screening result (Table 1). Self-reported chronic obstructive pulmonary disease (COPD) was significantly higher among patients in the test cohort versus the training cohort (16% vs. 7%, P = 0.02). Using Student t test, we found no statistically significant difference in mean age between training and test cohort when stratifying then among male (P = 0.99) and among females (P = 0.73).

Table 1 Patient characteristics in the training and test cohorts.

Radiomic analyses

A total of 109 peritumoral features were extracted, of which 56 were found to be both stable and reproducible, and a total of 155 intratumoral features were extracted, of which 35 were stable and reproducible. Therefore, a total of 91 stable and reproducible radiomics features (peritumoral and intratumoral) were subjected to univariable analysis. In univariable analyses, 40 of the 91 radiomic features (26 peritumoral and 14 intratumoral) were significantly associated with OS in the training cohort (Supplemental Table 1) and 30 of the 40 features were eliminated because they were correlated. The 10 remaining features were reduced to four highly informative features using backward elimination (Supplemental Table 2). Among the four features, three were peritumoral (average co-occurrence joint entropy, NGTDM busyness, and average co-occurrence angular second moment) and one was intratumoral (statistical root mean square).

Classification and regression tree (CART) analysis

The four remaining radiomic features were subjected to Classification And Regression Tree (CART) analysis in the training cohort and based on 2 radiomic features (NGTDM Busyness and Statistical Root Mean Square). CART analysis classified patients into four risk groups: low-risk, intermediate-risk-1, intermediate-risk-2, and high-risk (Supplemental Fig. 1). The four risk groups were reduced to three risk groups by combining the two intermediate-risk groups (Fig. 1B) and 3 risk groups from the CART model was replicated in the test cohort.

Figure 1
figure 1

Identification of risk groups based on peritumoral and intratumoral features (A) Statistical analysis pipeline for radiomics feature selection. (B) The tree structure of the classification and regression tree analysis (CART) which identified four risk groups based on two radiomics features, in which we combined the two intermediate risk groups.

In training cohort, the high-risk group (Fig. 2A) was associated with extremely poor Overall Survival (OS) (Hazard Ratio (HR) = 14.67; 10% 2.5-year OS and 0% 5-year OS, log-rank P < 0.0001) versus the intermediate (HR = 3.25; 63% 2.5-year OS and 41% 5-year OS) and low-risk group (HR = 1.00; 89% 2.5-year OS and 78% 5-year OS). In the test cohort, the high-risk group was associated with extremely poor OS (HR = 3.35; 50% 2.5-year OS and 0% 5-year OS, log-rank P = 0.043) versus the low-risk group (HR = 1.00, 68% 2.5 year OS and 51% 5-year OS) (Fig. 2A). Similar findings were observed for Progression Free Survival (PFS) (Fig. 2B).

Figure 2
figure 2

Risk-groups associated with overall survival for training and test cohorts and among early stage patients. Across the training and test cohort as well as in early-stage the high-risk group had a significantly worse outcome in OS (A) and PFS (B).

Patient characteristics of the three risk groups in the total data set (N = 234)

There were no statistically significant differences between the three risk groups by age, smoking status, number of pack-years smoked, self-reported COPD, and family history of lung cancer, histological subtypes, and treatment (Table 2). However, there were statistically significant differences across the risk groups for sex (P = 0.04) and stage of disease (P = 0.001). Specifically, 92% of the patients in the high-risk group in were male vs. 54% in the low-risk group (P = 0.04) (Table 2). In term of lung cancer stage, 33% of the patients in the high-risk group had lung cancer early-stage vs. 80% in the low-risk group (P = 0.001).

Table 2 Patient characteristics of the three risk groups in the total data set (N = 234).

Survival analyses among early stage patients

Among all early-stage patients, the high-risk group was associated with a significantly decreased OS (HR = 9.91; 25% 2.5-year and 0% 5-year OS, log-rank P < 0.0001) versus the low-risk group (HR = 1.00; 93% 2.5-year and 78% 5-year OS) (Fig. 2A). Similar results were found for PFS (Fig. 2B).

Validation dataset

Using non-screen detected we attempted to replicate the risk groups obtained from the CART model. Among early-stage adenocarcinoma lung cancers (Fig. 3A), the high-risk group was associated with worse OS (HR = 2.63; 56% 2.5-year and 42% 5-year OS, log-rank P = 0.112) compared to the low-risk group (HR = 1.00; 75% 2.5-year and 75% 5-year OS). Among late-stage patients (Fig. 3B), the risk groups were not associated with survival (log-rank P = 0.432).

Figure 3
figure 3

Overall survival for the risk patient risk groups among non-screen detected adenocarcinoma lung cancers (A) and for early-stage (B) for late-stage patients.

Multivariable analyses

Multivariable Cox regression models were used to adjust for potential confounding factors including sex, treatment, and stage. In the training cohort, the high-risk group was associated with an elevated hazard ratio (OS: HR = 9.71; 95% Confidence Interval: [3.85, 24.48] and PFS: HR = 5.68; 95% Confidence Interval: [2.32, 13.93] ) when compared to intermediate and low-risk groups (Tables 3 and 4). In the test cohort, the high-risk group yielded an increased hazard ratio in PFS when compared to intermediate and low-risk groups (PFS: HR = 2.02; 95% Confidence Interval: [0.34, 11.99]). Among all patients, the high-risk group was associated with a significantly elevated hazard ratio (OS: HR = 5.16; 95% Confidence Interval: [2.34, 11.37] and PFS: HR = 4.02; 95% Confidence Interval: [1.89, 8.56]) when compared to the intermediate and low-risk group. For OS and PFS analysis, smoking status, sex, nor baseline screening were associated with our model.

Table 3 Multivariable Cox proportional hazards models for overall survival in the training and test cohorts.
Table 4 Multivariable Cox proportional hazards models for progression free survival in the training and test cohorts.

Performance metrics: Harrell’s c index and areas under the curve (AUROC)

The discrimination performance of the multivariable model was estimated using the Harrell’s C index. The multivariable model showed a better discrimination capability with a higher C indices in the training and test cohort when analyzing OS (C-index: 0.83 and 0.81, training and test cohort respectively) compared to PFS (C-index: 0.81 and 0.80, training and test cohort respectively). Using the multivariable models, time-dependent areas under the curve (AUROC) for OS were generated and produced an AUROC of 0.878 at 2-years and 0.800 for 4-years. Among early-stage patients, the AUROC was 0.702 for 2-years and 0.669 and 4-years (Fig. 4). Similar results were found for PFS (Supplemental Fig. 3).

Figure 4
figure 4

Time-dependent AUC plot of the multivariable model for overall survival for (A) all patients and (B) early-stage patients.

Radiogenomics analysis

From the radiogenomic analyses, FOXF2 and LOC285043 were found to be significantly associated with RMS feature in both the pair wise analysis and correlation analysis (Supplementary Table 3). Stage was not significantly associated with the two most informative radiomics features (P = 0.6828 for RMS and P = 0.7905 for NGTDM Busyness). From the correlation analyses, four genes were positively significantly correlated with the RMS radiomic feature FOXF2, TBX4, LOC285043 and TM4SF18. Among these four genes, FOXF2 had the highest correlation (r = 0.45) with RMS while the other genes were correlated at r = 0.44 (Fig. 5A). In the pair-wises analyses where RMS was dichotomized at the median value, six genes were significantly different the RMS high and RMS low. The mean value of FOXF2 was significantly higher for RMS high vs. RMS low (mean = 7.1050 [SD = 1.0664] vs. 7.9798 [0.7054], p < 0.001) (Fig. 5B). When we generated the risk groups in the radiogenomics dataset, FOXF2 expression was significantly lower for the intermediate-risk group vs. low-risk group (mean = 7.2408 [SD = 1.0755] vs. 8.0287 [0.7524], p < 0.001). Among the three risk groups, the intermediate-risk had the lower FOXF2 expression when compared to the high-risk and low-risk group. Although FOXF2 expression was lower for the high-risk group vs. low-risk group (mean = 7.6819 [SD = 0.5879] vs. 8.0287[0.7521]), the difference was not statistically significant (p = 0.539).

Figure 5
figure 5

The association between radiomics gene expression. (A) Correlation between intratumoral RMS radiomic feature and FOXF2. (B) FOXF2 expression by dichotomizing RMS at the median and FOXF2 expression by the three patient risk groups. (C) RABGAP1L expression by dichotomizing NGTDM busyness at the median and NGTDM busyness expression by the three patient risk groups.

No genes were significantly correlated with NGTDM busyness. In the pairwise analyses where NGTDM busyness was dichotomized at the median, three genes were significantly different between NGTDM busyness high vs. NGTDM busyness low: RABGAP1L, LOC101928674, and LDLRAD4-AS1. RABGAP1L expression was significantly higher NGTDM busyness low vs NGTDM busyness high (mean = 6.7612 [SD = 0.8313] vs. 6.1066 [0.8200], P < 0.001). For the risk groups, RABGAP1L expression was higher for the intermediate-risk group vs. low-risk group (mean = 6.5938 [SD = 0.8866] vs. 6.2120 [0.8981], P = 0.100). RABGAP1 expression was higher for the high-risk group vs. low-risk group (mean = 6.2534 [SD = 0.6919] vs. 6.2120 [0.8981]), the difference was not statistically significant (P = 0.990). Stage was not significantly associated with the two most informative radiomics features (P = 0.6828 for RMS and P = 0.7905 for NGTDM Busyness) nor with the two most informative genes (P = 0.7767 for FOXF2 and P = 0.7928 for RABGAP1L).

Discussion

Predictive biomarkers that identify aggressive cancers from those that are either indolent, or at lower-risk of poor survival outcomes, are a critical unmet need in the lung cancer screening setting. In this study, we utilized peritumoral and intratumoral radiomic features to generate a model that is able to detect vulnerable group of screen-detected early stage lung cancer patients that have high-risk of experiencing poor survival outcomes. Specifically, we identified a model that contained two radiomics features, one peritumoral and one intratumoral, which stratified patients into three risk-groups: low-risk, intermediate-risk, and high-risk. The model identified a vulnerable group early-stage patients with worse OS (HR = 9.91; 25% 2.5-year and 0% 5-year OS) versus the low-risk group (HR = 1.00; 93% 2.5-year and 78% 5-year OS). The final model was validated in the test cohort and further replicated in a cohort of non-screen detected adenocarcinoma patients. Because disease stage was significantly different across the risk groups, we stratified the model by stage and found compelling results among early stage patients, which typically have very good survival outcomes. Among early stage patients, the high-risk group was associated with a worse OS (HR = 2.63; 56% 2.5-year and 42% 5-year OS) compared to the low-risk group (HR = 1.00; 75% 2.5-year and 75% 5-year OS).

Radiomics is a non-invasive approach that utilizes standard-of-care imaging to generate quantitative image features that can be used for risk prediction, diagnostic discrimination, prognostication, and to predict treatment response14,15,21,22,23. Prior studies have shown that peritumoral features, extracted from the area surrounding the tumor parenchyma, and intratumoral features, extracted from the area within the tumor, have prognostic and predictive utility in cancers such as lung, breast, brain, gastric, and head and neck1,24,25,26,27,28,29. For example, Dong et al. (29) developed an individualized nomogram using radiomic features from primary tumor and from the peritoneum to identify occult peritoneal metastasis among patients with advanced gastric cancer29. Using peritumoral features from contrast-enhanced magnetic resonance imaging, Braman et al. found that CoL1AGe entropy was associated with pathological complete response among breast cancer patients who are Her2 negative26. In, Xu et al. identified a radiomic score using features from the peritumoral region of hepatocellular carcinoma tumors that predicts microvascular involvement30. Cumulative, the evidence of our study and others have demonstrate the utility of using peritumoral features alone or in combination with intratumoral features.

In this study, we identified a highly informative peritumoral feature (NGTDM busyness) and a highly informative intratumoral feature (statistical RMS). NGTDM, a texture feature, captures intensity values of a neighborhood of pixels to characterize the difference between a center voxel within the neighborhood31,32. NGTDM parameters are coarseness, contrast, and busyness. Coarseness describes the granularity of an image, contrast relates to the dynamic range of intensity, and busyness relate to the rate of intensity change within an image32. NGTDM busyness has a high predictive power in differentiating between glioblastoma and primary central nervous system lymphoma33. Studies found that NGTDM busyness extracted from positron emission tomography (PET) is useful for discriminating benign from malignant solid pulmonary nodules34. Intensity-based features are derived from image histograms which represent the intensity distribution of the image. Parmar et al. 2015 showed a correlation between intensity feature and patient survival in lung and head and neck cancer. Statistical RMS, an intensity feature, is a first-order statistic that calculates the root mean square of the voxel’s intensity value31,35. A previous study showed that statistical RMS was able to predict pathological response after chemoradiation in non-small cell lung cancer (NSCLC), by identifying gross residual response35. Statistical RMS combined with other intensity statistical features was able to distinguish bladder tumor tissue from other normal tissues in fluorodeoxyglucose-positron emission tomography (FDG-PET) scan36.

The radiogenomics analyses revealed that the most informative intratumoral radiomic feature, RMS, was significantly associated with expression of FOXF2 and LOC285043 which is an uncharacterized gene37. We observed a trend for lower expression of FOXF2 in intermediate and high-risk groups versus the low-risk group. However, this trend in high and low-risk group was not statistically significant. FOXF2 is expressed in the lung and functions as an activator or inhibitor of gene transcription38 and up-regulation of FOXF2 expression induces EMT, migration, invasion and metastasis in breast cancer39. To support our findings that low expression of FOXF2 is a negative prognostic factor, a prior study demonstrated that patients with stage I NSCLC who had low FOXF2 expression had significantly shorter overall survival compared to patients with high FOXF2 expression40. RABGAP1L was found to be significantly associated with NGTDM Busyness in the pairwise analyses and we revealed that the intermediate and high-risk groups had higher expression of RABGAP1L versus the low-risk group. At present, there is no known role for RABGAP1L in lung cancer. However, RABGAP1L has been shown to deregulate the tyrosine-kinase signaling pathway in acute myeloid leukemia41 and regulates the activity of GTPases which is essential to transport cell adhesion proteins and migrating cells42.

We acknowledge some limitations of this study. First, the sample size is somewhat modest because we utilized lung cancer cases with specific inclusion/exclusion criteria from the NLST. Then when we split the available number of cases into training and test cohorts based on a 70:30 ratio; the resulting sample sizes likely attributed to the poorly calibrated model based on its ability to predict 5-years survival outcomes in the training and test cohorts. However, we applied a rigorous feature reduction approach to eliminate correlated and non-reproducible features and utilized a backward reduction approach to identify a parsimonious model containing the most important features to reduce false positive findings. Although the overall population of lung cancer patients were treated heterogeneously; however, among the early stage patients, 92.74% of the patients had surgery as their only treatment. We also recognize that there were limited numbers of patients in the high-risk groups for these cohorts. Finally, our validation cohort was limited to patients with lung adenocarcinoma. Additional research is needed to validate the biological underpinnings of these features.

The results from our analyses produced a parsimonious radiomic model that identified a vulnerable subset of screen-detected lung cancers that are associated with poor outcome. These findings could support more aggressive treatment and follow-up for such high-risk patients. Nonetheless, additional research will be needed to inform the potential translational implications of these findings, to fully elucidate the biology these high-risk screen-detected tumors, to assess whether these findings are consistent across screening trials and cohorts, and how best to personalize cancer management in these vulnerable patients.

Methods

NLST study population

Deidentified data and LDCT images were accessed from the National Cancer Institute (NCI) Cancer Data Access System (CDAS)43. The NLST study design and main findings have been described previously1,28. NLST eligibility criteria included current and former smokers aged 55‐74 years with a minimum 30 pack‐years smoking history and former smokers had to have quit within the past 15 years.

Based on the schema described in Schabath et al.44, this analysis considered 321 NLST participants who had a negative or positive baseline screening (T0) result and were diagnosed with a screen-detected, incidental lung cancer on follow-up screening intervals 12 (T1) or 24 (T2) months after T0. Positive screens were defined as abnormalities on baseline screens or at follow-up screens that were new, stable or evolved. Negative screens were defined as a computerized tomography (CT) scan with no abnormalities, minor abnormalities, or significant abnormalities not suspicious for lung cancer. Among the 321 lung cancer patients, 196 had a baseline positive screen that was not diagnosed as lung cancer but evolved and diagnosed as lung cancer at T1 or T2 screening intervals. The remaining 125 lung cancer patients had a baseline negative screen result but developed a nodule that was diagnosed as lung cancer at either T1 or T2. Lung cancer patients who had multiple nodules at time of their diagnosis were excluded (N = 58) since we are unable to verify which nodule(s) were cancer. Complications with segmentation, such as a nodule attached to lung wall, led to 29 patients being excluded. The final dataset of 234 screen-detected lung cancers were randomly split into a training cohort (N = 161) and a test cohort (N = 73).

Non-screen detected lung cancer validation dataset

The radiomics data were further validated for OS in a prior published dataset was comprised of 62 adenocarcinoma patients who underwent surgical resection as first course therapy at the Moffitt Cancer Center and had pre-surgery CTs within 2 months prior surgery45.

Radiogenomics dataset

A previously described dataset46 of surgically resected adenocarcinoma lung cancers (N = 103) who had pre-surgery CTs and gene expression data was used to identify potential biological underpinnings of the final two informative radiomic features (RMS and NGTDM Busyness). The gene expression data were IRON-normalized and batch-corrected for RNA quality Pathway and Gene Ontology Enrichment using Clarivate Analytics MetaCore46.

Radiomics

Nodule identification and tumor segmentation is described elsewhere15. The tumor mask images (i.e., tumor delineations) were imported into in-house radiomic feature extraction toolboxes created in MATLAB® 2015b (The Mathworks Inc., Natick, Massachusetts) and C+ + (https://isocpp.org). Using cubic interpolation, the images were resampled to a single voxel spacing of 1 mm × 1 mm × 1 mm to standardize spacing across all images. Hounsfield units (HU) in all images were resampled into fixed bin sizes of 25 HUs discretized from –1,000 to 1,000 HU.

Using standardized radiomic algorithms from the Image Biomarker Standardization Initiative (IBSI) v547, a total of 264 radiomic features were extracted from semi-automatic segmented intratumoral region (n = 155) and from the peritumoral region (n = 109) 3 mm outside of tumor boundary. The peritumoral regions were generated as an extension of the tumor segmentations using morphological image processing operations as previously mentioned48. Peritumoral regions were bounded by a lung parenchyma mask to exclude the region of interest (ROI) outside of the lung parenchyma. Shape and size based peritumoral features were excluded as they explicitly describe (i.e., correlate) the intratumoral ROI. Only the most stable and reproducible intratumoral and peritumoral radiomic features that were previously found on another study of our group 48 were utilized for analyses to create repeatable models. Further details of the feature selection process are presented in the statistical analysis section below (Fig. 1A).

Statistical analysis

Statistical analyses were performed using Stata/MP 14.2 (StataCorp LP, College Station TX), R Project for Statistical Computing (version 3.5.2), and R Studio (version 1.1.463). Fisher’s exact test was used to test the difference between the training and test cohorts for categorical variables and the Student’s t-test was used to test the difference between the training and test cohorts for continuous variables.

Overall survival (OS) and progression-free survival (PFS) were the main end-points for the analysis and were assessed from date of lung cancer diagnosis to the date of an event or last follow up. For OS, an event was defined as death and for PFS an event was established as death or progression of cancer. All survival data were right censored at 5-years. To generate a parsimonious radiomics model, we first performed univariable analyses using Cox Proportional Hazard regression and retained features with a P value < 0.10. Among the remaining features after univariable analyses, we removed features that were correlated based on Pearson’s correlation coefficient > 0.80. If two or more features were correlated based on an absolute Pearson’s correlation coefficient > 0.80, the feature with the smaller p-value from the univariable analyses was retained. The remaining radiomic features were subjected to backward elimination approach using a pre-specified more stringent P value < 0.01 for inclusion. Among the remaining covariates, Classification And Regression Tree analysis (CART) was used to stratify patients into risk groups. CART is a nonparametric data-mining tool that can identify hierarchical interactions and segment covariates into novel and meaningful terminal subgroups (i.e., nodes). The hazard ratios from the risk groups were generated using Cox Proportional Hazard regression. The risk groups were also analyzed by Kaplan Meier curves and log-rank tests. The risk groups based on the most informative radiomic features in the training cohort were validated in the test cohort and further replicated in the adenocarcinoma cohort. The Harrell’s concordance index (C-index) was used to evaluate the multivariable model. Time-dependent area under the receding operating curve (AUROC) analyses was used to assess accuracy of the Cox regression models at different time-points using R packages survival49, survminer50, and survivalROC51.

Using the radiogenomics dataset, analyses were conducted to determine if the two final radiomic features (RMS and NGTDM Busyness) were associated with the gene probesets using two different approaches: correlation and two-group analysis. For the correlation analysis, gene probesets were filtered and determined as statistically significant using the following criteria: Pearson’s correlation with a threshold |R|> 0.4, an expression filter with max expression of gene > 5, and an inter-quartile filter (IQR > log2 (1.2 FC)). For the two-group analyses, gene probesets were filtered and determined as significant using the following criteria based on a Student’s t test p < 0.001 and mean log fold-change between high and low prognostic radiomic feature oflfc > log2 (1.4 FC). The significant probesets from the two-group analyses were intersected yielding a final list of probesets significantly associated with the most informative radiomic feature. ANOVA and Tukey pairwise mean comparison was performed to analyzed gene expression across the risk groups.