More Web Proxy on the site http://driver.im/

research-article

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

Authors:

Vineet K. Raghu,

Daniel J. Shirer,

Panayiotis V. Benos,

Panos K. ChrysanthisAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 18, Issue 3

Pages 811 - 822

https://doi.org/10.1109/TCBB.2020.3019237

Published: 25 August 2020 Publication History

Abstract

Genome sequencing technologies have the potential to transform clinical decision making and biomedical research by enabling high-throughput measurements of the genome at a granular level. However, to truly understand mechanisms of disease and predict the effects of medical interventions, high-throughput data must be integrated with demographic, phenotypic, environmental, and behavioral data from individuals. Further, effective knowledge discovery methods must infer relationships between these data types. We recently proposed a pipeline (CausalMGM) to achieve this. CausalMGM uses probabilistic graphical models to infer the relationships between variables in the data; however, CausalMGM’s graphical structure learning algorithm can only handle small datasets efficiently. We propose a new methodology (piPref-Div) that selects the most informative variables for CausalMGM, enabling it to scale. We validate the efficacy of piPref-Div against other feature selection methods and demonstrate how the use of the full pipeline improves breast cancer outcome prediction and provides biologically interpretable views of gene expression data.

References

[1]

Z. M. Hira and D. F. Gillies, “A review of feature selection and feature extraction methods applied on microarray data,” Advances Bioinf., vol. 2015, 2015, Art. no.

[2]

Y. Cun and H. Fröhlich, “Prognostic gene signatures for patient stratification in breast cancer-accuracy, stability and interpretability of gene selection approaches using prior knowledge on protein-protein interactions,” BMC Bioinf., vol. 13, no. 1, 2012, Art. no.

[3]

A.-C. Haury, P. Gestraud, and J.-P. Vert, “The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures,” PloS One, vol. 6, no. 12, 2011, Art. no.

[4]

D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. Cambridge, MA, USA: MIT Press, 2009.

Digital Library

[5]

A. J. Sedgewick, I. Shi, R. M. Donovan, and P. V. Benos, “Learning mixed graphical models with separate sparsity parameters and stability-based model selection,” BMC Bioinf., vol. 17, no. S5, 2016, Art. no.

[6]

V. K. Raghu, et al., “Comparison of strategies for scalable causal discovery of latent variable models from mixed data,” Int. J. Data Sci. Analytics, vol. 6, pp. 33–45, 2018.

[7]

D. V. Manatakis, V. K. Raghu, and P. V. Benos, “piMGM: Incorporating multi-source priors in mixed graphical models for learning disease networks,” Bioinformatics, vol. 34, no. 17, pp. i848–i856, 2018.

[8]

A. J. Sedgewick, et al., “Mixed graphical models for integrative causal analysis with application to chronic lung disease diagnosis and prognosis,” Bioinformatics, vol. 35, no. 7, pp. 1204–1212, 2019.

[9]

V. K. Raghu, et al., “Feasibility of lung cancer prediction from low-dose CT and smoking factors using causal models,” Thorax, vol. 74, no. 7, pp. 643–649, 2019.

[10]

I. Abecassis, et al., “PARP1 rs1805407 increases sensitivity to PARP1 inhibitors in cancer cells suggesting an improved therapeutic strategy,” Sci. Rep., vol. 9, no. 1, pp. 1–9, 2019.

[11]

G. D. Kitsios, et al., “Respiratory microbiome profiling for etiologic diagnosis of pneumonia in mechanically ventilated patients,” Front. Microbiol., vol. 9, 2018, Art. no.

[12]

J. Lemeire, S. Meganck, F. Cartella, and T. Liu, “Conservative independence-based causal structure learning in absence of adjacency faithfulness,” Int. J. Approx. Reasoning, vol. 53, no. 9, pp. 1305–1325, 2012.

[13]

G. T. Huang, I. Tsamardinos, V. Raghu, N. Kaminski, and P. V. Benos, “T-RECS: Stable selection of dynamically formed groups of features with application to prediction of clinical outcomes,” in Proc. Pacific Symp. Biocomput., 2015, pp. 431–42.

[14]

A. Allahyar and J. De Ridder, “FERAL: Network-based classifier with application to breast cancer outcome prediction,” Bioinformatics, vol. 31, no. 12, pp. i311–i319, 2015.

[15]

I. W. Taylor, et al., “Dynamic modularity in protein interaction networks predicts breast cancer outcome,” Nat. Biotechnol., vol. 27, no. 2, 2009, Art. no.

[16]

D. Venet, J. E. Dumont, and V. Detours, “Most random gene expression signatures are significantly associated with breast cancer outcome,” PLoS Comput. Biol., vol. 7, no. 10, 2011, Art. no.

[17]

C. Staiger, S. Cadot, B. Györffy, L. F. Wessels, and G. W. Klau, “Current composite-feature classification methods do not outperform simple single-genes classifiers in breast cancer prognosis,” Front. Genetics, vol. 4, 2013, Art. no.

[18]

V. K. Raghu, X. Ge, P. K. Chrysanthis, and P. V. Benos, “Integrated theory-and data-driven feature selection in gene expression data analysis,” in Proc. IEEE 33rd Int. Conf. Data Eng., 2017, pp. 1525–1532.

[19]

X. Ge, P. K. Chrysanthis, and A. Labrinidis, “Preferential diversity,” in Proc. 2nd Int. Workshop Explor. Search Databases Web, 2015, pp. 9–14.

[20]

M. Johannes, et al., “Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients,” Bioinformatics, vol. 26, no. 17, pp. 2136–2144, 2010.

[21]

R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. Methodol., vol. 58, no. 1, pp. 267–288, 1996.

[22]

W. Pan, “Incorporating gene functions as priors in model-based clustering of microarray gene expression data,” Bioinformatics, vol. 22, no. 7, pp. 795–801, 2006.

[23]

J. Cheng, et al., “A knowledge-based clustering algorithm driven by gene ontology,” J. Biopharm. Statist., vol. 14, no. 3, pp. 687–700, 2004.

[24]

X. Chen and L. Wang, “Integrating biological knowledge with gene expression profiles for survival prediction of cancer,” J. Comput. Biol., vol. 16, no. 2, pp. 265–278, 2009.

[25]

L. Wang, J. Zhu, and H. Zou, “Hybrid huberized support vector machines for microarray classification and gene selection,” Bioinformatics, vol. 24, no. 3, pp. 412–419, 2008.

[26]

N. Simon, J. Friedman, T. Hastie, and R. Tibshirani, “A sparse-group lasso,” J. Comput. Graphical Statist., vol. 22, no. 2, pp. 231–245, 2013.

[27]

Y. Zhu, X. Shen, and W. Pan, “Network-based support vector machine for classification of microarray samples,” BMC Bioinf., vol. 10, no. 1, 2009, Art. no.

[28]

M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res., vol. 28, no. 1, pp. 27–30, 2000.

[29]

K. R. Brown and I. Jurisica, “Online predicted human interaction database,” Bioinformatics, vol. 21, no. 9, pp. 2076–2082, 2005.

[30]

N. Bandyopadhyay, T. Kahveci, S. Goodison, Y. Sun, and S. Ranka, “Pathway-based feature selection algorithm for cancer microarray data,” Advances Bioinf., vol. 2009, 2010, Art. no.

[31]

Z. Guo, et al., “Towards precise classification of cancers based on robust gene functional expression profiles,” BMC Bioinf., vol. 6, no. 1, 2005, Art. no.

[32]

N. Alcaraz, M. List, R. Batra, F. Vandin, H. J. Ditzel, and J. Baumbach, “De novo pathway-based biomarker identification,” Nucleic Acids Res., vol. 45, no. 16, pp. e151–e151, 2017.

[33]

B. Fellinghauer, et al., “Stable graphical model estimation with random forests for discrete, continuous, and mixed variables,” Comput. Statist. Data Anal., vol. 64, pp. 132–152, 2013.

[34]

E. Yang, Y. Baker, P. Ravikumar, G. Allen, and Z. Liu, “Mixed graphical models via exponential families,” in Proc. 17th Int. Conf. Artif. Intell. Statist., 2014, pp. 1042–1050.

[35]

J. Cheng, T. Li, E. Levina, and J. Zhu, “High-dimensional mixed graphical models,” J. Comput. Graphical Statist., vol. 26, pp. 367–378, 2017.

[36]

S. Chen, D. M. Witten, and A. Shojaie, “Selection and estimation for mixed graphical models,” Biometrika, vol. 102, no. 1, pp. 47–64, 2014.

[37]

J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.

[38]

I. Tur and R. Castelo, “Learning mixed graphical models from data with p larger than n,” in Proc. 27th Conf. Uncertainty Artif. Intell., 2011, pp. 689–697.

[39]

J. Besag, “Statistical analysis of non-lattice data,” J. Roy. Statist. Soc. Ser. D, vol. 24, pp. 179–195, 1975.

[40]

J. D. Lee and T. J. Hastie, “Learning the structure of mixed graphical models,” J. Comput. Graphical Statist., vol. 24, no. 1, pp. 230–253, 2015.

[41]

M. A. Hall, “Correlation-based feature selection of discrete and numeric class machine learning,” in Proc. 17th Int. Conf. Mach. Learn., 2000, pp. 359–366.

[42]

C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” J. Bioinf. Comput. Biol., vol. 3, no. 02, pp. 185–205, 2005.

[43]

B. Zhang and S. Horvath, “A general framework for weighted gene co-expression network analysis,” Statist. Appl. Genetics Mol. Biol., vol. 4, 2005, Art. no.

[44]

C. Desmedt, et al., “Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series,” Clin. Cancer Res., vol. 13, no. 11, pp. 3207–3214, 2007.

[45]

C. Sotiriou, et al., “Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis,” J. Nat. Cancer Inst., vol. 98, no. 4, pp. 262–272, 2006.

[46]

Y. Wang, et al., “Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer,” Lancet, vol. 365, no. 9460, pp. 671–679, 2005.

[47]

Y. Pawitan, et al., “Gene expression profiling spares early breast cancer patients from adjuvant therapy: Derived and validated in two population-based cohorts,” Breast Cancer Res., vol. 7, no. 6, 2005, Art. no.

[48]

A. V. Ivshina, et al., “Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer,” Cancer Res., vol. 66, no. 21, pp. 10 292–10 301, 2006.

[49]

M. Schmidt, et al., “The humoral immune system has a key prognostic impact in node-negative breast cancer,” Cancer Res., vol. 68, no. 13, pp. 5405–5413, 2008.

[50]

G. Jiang, et al., “Comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer,” BMC Genomics, vol. 17, no. 7, 2016, Art. no.

[51]

J. Piñero, et al., “DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants,” Nucleic Acids Res., vol. 45, pp. D833–D839, 2017.

[52]

A. D. Rouillard, et al., “The harmonizome: A collection of processed datasets gathered to serve and mine knowledge about genes and proteins,” Database, vol. 2016, 2016, Art. no.

[53]

D. Szklarczyk, et al., “STRING v10: Protein–protein interaction networks, integrated over the tree of life,” Nucleic Acids Res., vol. 43, no. D1, pp. D447–D452, 2014.

[54]

A. Subramanian, et al., “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Nat. Acad. Sci. United States America, vol. 102, no. 43, pp. 15 545–15 550, 2005.

[55]

D. Alan and M. D’Andrea, “The Fanconi anemia and breast cancer susceptibility pathways,” New England J. Med., vol. 362, no. 20, 2010, Art. no. 1909.

[56]

A. M. Mulligan, et al., “Common breast cancer susceptibility alleles are associated with tumour subtypes in BRCA1 and BRCA2 mutation carriers: Results from the consortium of investigators of modifiers of BRCA1/2,” Breast Cancer Res., vol. 13, no. 6, 2011, Art. no.

[57]

D. Horiuchi, et al., “MYC pathway activation in triple-negative breast cancer is synthetic lethal with CDK inhibition,” J. Exp. Med., vol. 209, no. 4, pp. 679–696, 2012.

[58]

S. Chaudhary, B. M. Krishna, and S. K. Mishra, “A novel FOXA1/ESR1 interacting pathway: A study of Oncomine™ breast cancer microarrays,” Oncology Lett., vol. 14, no. 2, pp. 1247–1264, 2017.

[59]

T. M. Wright, et al., “Delineation of a FOXA1/ER\(\alpha\)

α

/AGR2 regulatory loop that is dysregulated in endocrine therapy–resistant breast cancer,” Mol. Cancer Res., vol. 12, no. 12, pp. 1829–1839, 2014.

[60]

J. M. Westcott, et al., “An epigenetically distinct breast cancer cell subpopulation promotes collective invasion,” J. Clin. Invest., vol. 125, no. 5, pp. 1927–1943, 2015.

Index Terms

A Pipeline for Integrated Theory and Data-Driven Modeling of Biomedical Data

Index terms have been assigned to the content through auto-classification.

Recommendations

An RNA-Seq Bioinformatics Pipeline for Data Processing of Arabidopsis Thaliana Datasets
ICBRA '17: Proceedings of the 4th International Conference on Bioinformatics Research and Applications

In the literature, it is often seen that computational tools aiming at RNA-Seq data analysis are applied to Arabidopsis thaliana using default parameters, which often results in inaccurate measurement of gene quantification and expression, as they are ...
Integrated querying of disparate association and interaction data in biomedical applications
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

In biomedical applications, network models are commonly used to represent interactions and higher-level associations among biological entities. Integrated analyses of these interaction and association data has proven useful in extracting knowledge, and ...
Outlier analysis and top scoring pair for integrated data analysis and biomarker discovery

Pathway deregulation has been identified as a key driver of carcinogenesis, with proteins in signaling pathways serving as primary targets for drug development. Deregulation can be driven by a number of molecular events, including gene mutation, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 18, Issue 3

May-June 2021

420 pages

ISSN:1545-5963

Issue’s Table of Contents

1545-5963 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 25 August 2020

Published in TCBB Volume 18, Issue 3

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
15
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents