[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

Published: 25 August 2020 Publication History

Abstract

Genome sequencing technologies have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand mechanisms of disease and predict the effects of medical interventions, high-throughput data must be integrated with demographic, phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods must infer relationships between these data types. We recently proposed a pipeline (CausalMGM) to achieve this. CausalMGM uses probabilistic graphical models to infer the relationships between variables in the data; however, CausalMGM’s graphical structure learning algorithm can only handle small datasets efficiently. We propose a new methodology (piPref-Div) that selects the most informative variables for CausalMGM, enabling it to scale. We validate the efficacy of piPref-Div against other feature selection methods and demonstrate how the use of the full pipeline improves breast cancer outcome prediction and provides biologically interpretable views of gene expression data.

References

[1]
Z. M. Hira and D. F. Gillies, “A review of feature selection and feature extraction methods applied on microarray data,” Advances Bioinf., vol. 2015, 2015, Art. no.
[2]
Y. Cun and H. Fröhlich, “Prognostic gene signatures for patient stratification in breast cancer-accuracy, stability and interpretability of gene selection approaches using prior knowledge on protein-protein interactions,” BMC Bioinf., vol. 13, no. 1, 2012, Art. no.
[3]
A.-C. Haury, P. Gestraud, and J.-P. Vert, “The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures,” PloS One, vol. 6, no. 12, 2011, Art. no.
[4]
D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA, USA: MIT Press, 2009.
[5]
A. J. Sedgewick, I. Shi, R. M. Donovan, and P. V. Benos, “Learning mixed graphical models with separate sparsity parameters and stability-based model selection,” BMC Bioinf., vol. 17, no. S5, 2016, Art. no.
[6]
V. K. Raghu, et al., “Comparison of strategies for scalable causal discovery of latent variable models from mixed data,” Int. J. Data Sci. Analytics, vol. 6, pp. 33–45, 2018.
[7]
D. V. Manatakis, V. K. Raghu, and P. V. Benos, “piMGM: Incorporating multi-source priors in mixed graphical models for learning disease networks,” Bioinformatics, vol. 34, no. 17, pp. i848–i856, 2018.
[8]
A. J. Sedgewick, et al., “Mixed graphical models for integrative causal analysis with application to chronic lung disease diagnosis and prognosis,” Bioinformatics, vol. 35, no. 7, pp. 1204–1212, 2019.
[9]
V. K. Raghu, et al., “Feasibility of lung cancer prediction from low-dose CT and smoking factors using causal models,” Thorax, vol. 74, no. 7, pp. 643–649, 2019.
[10]
I. Abecassis, et al., “PARP1 rs1805407 increases sensitivity to PARP1 inhibitors in cancer cells suggesting an improved therapeutic strategy,” Sci. Rep., vol. 9, no. 1, pp. 1–9, 2019.
[11]
G. D. Kitsios, et al., “Respiratory microbiome profiling for etiologic diagnosis of pneumonia in mechanically ventilated patients,” Front. Microbiol., vol. 9, 2018, Art. no.
[12]
J. Lemeire, S. Meganck, F. Cartella, and T. Liu, “Conservative independence-based causal structure learning in absence of adjacency faithfulness,” Int. J. Approx. Reasoning, vol. 53, no. 9, pp. 1305–1325, 2012.
[13]
G. T. Huang, I. Tsamardinos, V. Raghu, N. Kaminski, and P. V. Benos, “T-RECS: Stable selection of dynamically formed groups of features with application to prediction of clinical outcomes,” in Proc. Pacific Symp. Biocomput., 2015, pp. 431–42.
[14]
A. Allahyar and J. De Ridder, “FERAL: Network-based classifier with application to breast cancer outcome prediction,” Bioinformatics, vol. 31, no. 12, pp. i311–i319, 2015.
[15]
I. W. Taylor, et al., “Dynamic modularity in protein interaction networks predicts breast cancer outcome,” Nat. Biotechnol., vol. 27, no. 2, 2009, Art. no.
[16]
D. Venet, J. E. Dumont, and V. Detours, “Most random gene expression signatures are significantly associated with breast cancer outcome,” PLoS Comput. Biol., vol. 7, no. 10, 2011, Art. no.
[17]
C. Staiger, S. Cadot, B. Györffy, L. F. Wessels, and G. W. Klau, “Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis,” Front. Genetics, vol. 4, 2013, Art. no.
[18]
V. K. Raghu, X. Ge, P. K. Chrysanthis, and P. V. Benos, “Integrated theory-and data-driven feature selection in gene expression data analysis,” in Proc. IEEE 33rd Int. Conf. Data Eng., 2017, pp. 1525–1532.
[19]
X. Ge, P. K. Chrysanthis, and A. Labrinidis, “Preferential diversity,” in Proc. 2nd Int. Workshop Explor. Search Databases Web, 2015, pp. 9–14.
[20]
M. Johannes, et al., “Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients,” Bioinformatics, vol. 26, no. 17, pp. 2136–2144, 2010.
[21]
R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. Methodol., vol. 58, no. 1, pp. 267–288, 1996.
[22]
W. Pan, “Incorporating gene functions as priors in model-based clustering of microarray gene expression data,” Bioinformatics, vol. 22, no. 7, pp. 795–801, 2006.
[23]
J. Cheng, et al., “A knowledge-based clustering algorithm driven by gene ontology,” J. Biopharm. Statist., vol. 14, no. 3, pp. 687–700, 2004.
[24]
X. Chen and L. Wang, “Integrating biological knowledge with gene expression profiles for survival prediction of cancer,” J. Comput. Biol., vol. 16, no. 2, pp. 265–278, 2009.
[25]
L. Wang, J. Zhu, and H. Zou, “Hybrid huberized support vector machines for microarray classification and gene selection,” Bioinformatics, vol. 24, no. 3, pp. 412–419, 2008.
[26]
N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse-group lasso,” J. Comput. Graphical Statist., vol. 22, no. 2, pp. 231–245, 2013.
[27]
Y. Zhu, X. Shen, and W. Pan, “Network-based support vector machine for classification of microarray samples,” BMC Bioinf., vol. 10, no. 1, 2009, Art. no.
[28]
M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res., vol. 28, no. 1, pp. 27–30, 2000.
[29]
K. R. Brown and I. Jurisica, “Online predicted human interaction database,” Bioinformatics, vol. 21, no. 9, pp. 2076–2082, 2005.
[30]
N. Bandyopadhyay, T. Kahveci, S. Goodison, Y. Sun, and S. Ranka, “Pathway-based feature selection algorithm for cancer microarray data,” Advances Bioinf., vol. 2009, 2010, Art. no.
[31]
Z. Guo, et al., “Towards precise classification of cancers based on robust gene functional expression profiles,” BMC Bioinf., vol. 6, no. 1, 2005, Art. no.
[32]
N. Alcaraz, M. List, R. Batra, F. Vandin, H. J. Ditzel, and J. Baumbach, “De novo pathway-based biomarker identification,” Nucleic Acids Res., vol. 45, no. 16, pp. e151–e151, 2017.
[33]
B. Fellinghauer, et al., “Stable graphical model estimation with random forests for discrete, continuous, and mixed variables,” Comput. Statist. Data Anal., vol. 64, pp. 132–152, 2013.
[34]
E. Yang, Y. Baker, P. Ravikumar, G. Allen, and Z. Liu, “Mixed graphical models via exponential families,” in Proc. 17th Int. Conf. Artif. Intell. Statist., 2014, pp. 1042–1050.
[35]
J. Cheng, T. Li, E. Levina, and J. Zhu, “High-dimensional mixed graphical models,” J. Comput. Graphical Statist., vol. 26, pp. 367–378, 2017.
[36]
S. Chen, D. M. Witten, and A. Shojaie, “Selection and estimation for mixed graphical models,” Biometrika, vol. 102, no. 1, pp. 47–64, 2014.
[37]
J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
[38]
I. Tur and R. Castelo, “Learning mixed graphical models from data with p larger than n,” in Proc. 27th Conf. Uncertainty Artif. Intell., 2011, pp. 689–697.
[39]
J. Besag, “Statistical analysis of non-lattice data,” J. Roy. Statist. Soc. Ser. D, vol. 24, pp. 179–195, 1975.
[40]
J. D. Lee and T. J. Hastie, “Learning the structure of mixed graphical models,” J. Comput. Graphical Statist., vol. 24, no. 1, pp. 230–253, 2015.
[41]
M. A. Hall, “Correlation-based feature selection of discrete and numeric class machine learning,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 359–366.
[42]
C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” J. Bioinf. Comput. Biol., vol. 3, no. 02, pp. 185–205, 2005.
[43]
B. Zhang and S. Horvath, “A general framework for weighted gene co-expression network analysis,” Statist. Appl. Genetics Mol. Biol., vol. 4, 2005, Art. no.
[44]
C. Desmedt, et al., “Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series,” Clin. Cancer Res., vol. 13, no. 11, pp. 3207–3214, 2007.
[45]
C. Sotiriou, et al., “Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis,” J. Nat. Cancer Inst., vol. 98, no. 4, pp. 262–272, 2006.
[46]
Y. Wang, et al., “Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer,” Lancet, vol. 365, no. 9460, pp. 671–679, 2005.
[47]
Y. Pawitan, et al., “Gene expression profiling spares early breast cancer patients from adjuvant therapy: Derived and validated in two population-based cohorts,” Breast Cancer Res., vol. 7, no. 6, 2005, Art. no.
[48]
A. V. Ivshina, et al., “Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer,” Cancer Res., vol. 66, no. 21, pp. 10 292–10 301, 2006.
[49]
M. Schmidt, et al., “The humoral immune system has a key prognostic impact in node-negative breast cancer,” Cancer Res., vol. 68, no. 13, pp. 5405–5413, 2008.
[50]
G. Jiang, et al., “Comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer,” BMC Genomics, vol. 17, no. 7, 2016, Art. no.
[51]
J. Piñero, et al., “DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants,” Nucleic Acids Res., vol. 45, pp. D833–D839, 2017.
[52]
A. D. Rouillard, et al., “The harmonizome: A collection of processed datasets gathered to serve and mine knowledge about genes and proteins,” Database, vol. 2016, 2016, Art. no.
[53]
D. Szklarczyk, et al., “STRING v10: Protein–protein interaction networks, integrated over the tree of life,” Nucleic Acids Res., vol. 43, no. D1, pp. D447–D452, 2014.
[54]
A. Subramanian, et al., “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Nat. Acad. Sci. United States America, vol. 102, no. 43, pp. 15 545–15 550, 2005.
[55]
D. Alan and M. D’Andrea, “The Fanconi anemia and breast cancer susceptibility pathways,” New England J. Med., vol. 362, no. 20, 2010, Art. no. 1909.
[56]
A. M. Mulligan, et al., “Common breast cancer susceptibility alleles are associated with tumour subtypes in BRCA1 and BRCA2 mutation carriers: Results from the consortium of investigators of modifiers of BRCA1/2,” Breast Cancer Res., vol. 13, no. 6, 2011, Art. no.
[57]
D. Horiuchi, et al., “MYC pathway activation in triple-negative breast cancer is synthetic lethal with CDK inhibition,” J. Exp. Med., vol. 209, no. 4, pp. 679–696, 2012.
[58]
S. Chaudhary, B. M. Krishna, and S. K. Mishra, “A novel FOXA1/ESR1 interacting pathway: A study of Oncomine™ breast cancer microarrays,” Oncology Lett., vol. 14, no. 2, pp. 1247–1264, 2017.
[59]
T. M. Wright, et al., “Delineation of a FOXA1/ER\(\alpha\)α/AGR2 regulatory loop that is dysregulated in endocrine therapy–resistant breast cancer,” Mol. Cancer Res., vol. 12, no. 12, pp. 1829–1839, 2014.
[60]
J. M. Westcott, et al., “An epigenetically distinct breast cancer cell subpopulation promotes collective invasion,” J. Clin. Invest., vol. 125, no. 5, pp. 1927–1943, 2015.

Index Terms

  1. A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
          IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 18, Issue 3
          May-June 2021
          420 pages

          Publisher

          IEEE Computer Society Press

          Washington, DC, United States

          Publication History

          Published: 25 August 2020
          Published in TCBB Volume 18, Issue 3

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 15
            Total Downloads
          • Downloads (Last 12 months)9
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 26 Dec 2024

          Other Metrics

          Citations

          View Options

          Login options

          Full Access

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media