[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2783258.2788609acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature

Published: 10 August 2015 Publication History

Abstract

We present KnIT, the Knowledge Integration Toolkit, a system for accelerating scientific discovery and predicting previously unknown protein-protein interactions. Such predictions enrich biological research and are pertinent to drug discovery and the understanding of disease. Unlike a prior study, KnIT is now fully automated and demonstrably scalable. It extracts information from the scientific literature, automatically identifying direct and indirect references to protein interactions, which is knowledge that can be represented in network form. It then reasons over this network with techniques such as matrix factorization and graph diffusion to predict new, previously unknown interactions. The accuracy and scope of KnIT's knowledge extractions are validated using comparisons to structured, manually curated data sources as well as by performing retrospective studies that predict subsequent literature discoveries using literature available prior to a given date. The KnIT methodology is a step towards automated hypothesis generation from text, with potential application to other scientific domains.

Supplementary Material

MP4 File (p2019.mp4)

References

[1]
Aronson, A.R. and Lang, F.M., 2010. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 17, 3 (May-Jun), 229--236.
[2]
Ashburner, M., et al., 2000. Gene ontology: tool for the unification of biology. Nat Genet 25, 1 (May), 25--29.
[3]
Berry, M.W., et al., 2007. Algorithms and applications for approximate nonnegative matrix factorization. Comp Statistics & Data Analysis 52, 1, 155--173.
[4]
Brohee, S. and Van Helden, J., 2006. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 1, 488.
[5]
Cancer Genome Atlas, N., 2012. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 7407 (Jul 19), 330--337.
[6]
Cancer Genome Atlas Research, N., 2008. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 7216 (Oct 23), 1061--1068.
[7]
Catral, M., et al., 2004. On reduced rank nonnegative matrix factorization for symmetric nonnegative matrices. Linear Algebra and its Applications 393, 107--126.
[8]
Chinnasamy, A., et al., 2006. Probabilistic prediction of protein-protein interactions from the protein sequences. Comput Biol Med 36, 10 (Oct), 1143--1154.
[9]
Cohen, A.M. and Hersh, W.R., 2005. A survey of current work in biomedical text mining. Brief Bioinform 6, 1 (Mar), 57--71.
[10]
Danger, R., et al., 2014. Towards a Protein-Protein Interaction information extraction system: Recognizing named entities. Knowledge-Based Systems 57, 104--118.
[11]
Davis, A.P., et al., 2011. The Comparative Toxicogenomics Database: update 2011. Nucleic Acids Res 39, Database issue (Jan), D1067--1072.
[12]
Edwards, A.M., et al., 2002. Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet 18, 10 (Oct), 529--536.
[13]
Franceschini, A., et al., 2013. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 41, (Jan), D808--815.
[14]
Fundel, K., et al., 2007. RelEx-Relation extraction using dependency parse trees. Bioinformatics 23, 3, 365--371.
[15]
Gene Ontology, C., 2008. The Gene Ontology project in 2008. Nucleic Acids Res 36, Database issue (Jan), D440--444.
[16]
Gene Ontology, C., 2010. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res 38, Database issue (Jan), D331--335.
[17]
Gray, K.A., et al., 2013. Genenames.org: the HGNC resources. Nucleic Acids Res 41, (Jan), D545--552.
[18]
Guillamet, D., et al., 2001. A weighted non-negative matrix factorization for local representations IEEE, I-942-I-947 vol. 941.
[19]
Hamza, A.B. and Brady, D.J., 2006. Reconstruction of reflectance spectra using robust nonnegative matrix factorization. IEEE Transactions on Signal Processing 54, 9, 3637--3642.
[20]
Hatzivassiloglou, V., et al., 2001. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 17 Suppl 1, suppl 1, S97--106.
[21]
Heath, L.S. and Sioson, A.A., 2009. Multimodal networks: structure and operations. IEEE/ACM Trans Comput Biol Bioinform 6, 2 (Apr-Jun), 321--332.
[22]
Hoffmann, R. and Valencia, A., 2004. A gene network for navigating the literature. Nat Genet 36, 7 (Jul), 664.
[23]
Hornbeck, P.V., et al., 2015. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res 43, Database issue (Jan), D512--520.
[24]
International Human Genome Sequencing, C., 2004. Finishing the euchromatic sequence of the human genome. Nature 431, 7011 (Oct 21), 931--945.
[25]
Jenssen, T.K., et al., 2001. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 1 (May), 21--28.
[26]
Jia, Y.W.Y. and Turk, C.H.M.,2004 Fisher non-negative matrix factorization for learning local features. In Proc. Asian Conf. on Comp. Vision, 27--30.
[27]
Jupe, S., et al., 2012. Reactome - a curated knowledgebase of biological pathways: megakaryocytes and platelets. Journal of Thrombosis and Haemostasis?: JTH, 10(11), 2399--2402.
[28]
Kim, J.D., et al., 2011. Overview of BioNLP shared task 2011 Association for Computational Linguistics, 1--6.
[29]
Kuchaiev, O., et al., 2009. Geometric de-noising of protein-protein interaction networks. PLoS Comput Biol 5, 8 (Aug), e1000454.
[30]
Laura, C., et al., 2010. SystemT: an algebraic approach to declarative information extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 128--137.
[31]
Lee, D.D. and Seung, H.S., 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct 21), 788--791.
[32]
Lee, D.D. and Seung, H.S., 2000. Algorithms for Non-negative Matrix Factorization. In NIPS, 556--562.
[33]
Lisewski, A.M. and Lichtarge, O., 2010. Untangling complex networks: risk minimization in financial markets through accessible spin glass ground states. Physica A 389, 16 (Aug 15), 3250--3253.
[34]
Manning, C.D., et al., 2008. Introduction to Information Retrieval. Cambridge University Press Cambridge.
[35]
Manning, C.D. and Schütze, H., 1999. Foundations of statistical natural language processing. MIT press.
[36]
Mccord, M.C. and Bernth, A., 2010. Using slot grammar. IBM TJ Watson Res. Center, Yorktown Heights, NY, IBM Res. Rep. RC23978.
[37]
Miyao, Y., et al., 2009. Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25, 3 (Feb 1), 394--400.
[38]
Nédellec, C., et al., 2013. Overview of BioNLP shared task 2013. Proceedings of the BioNLP Shared Task 2013 Workshop, 1--7.
[39]
Paatero, P. and Tapper, U., 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2, 111--126.
[40]
Poon, H., et al., 2014. Literome: -scale genomic knowledge base in the cloud. Bioinformatics 30, 19 (Oct), 2840--2842.
[41]
Pyysalo, S., et al., 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8, 50.
[42]
Quan, C., et al., 2014. An unsupervised text mining method for relation extraction from biomedical literature. PLoS One 9, 7, e102039.
[43]
Rzhetsky, A., et al., 2004. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37, 1 (Feb), 43--53.
[44]
Salton, G. and Mcgill, M.J., 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc.
[45]
Scott, S., et al., 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD, New York, New York, USA, 1877--1886.
[46]
Scott, S., et al., 2014. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD, New York, USA, 1877--1886.
[47]
Stumpf, M.P., et al., 2008. Estimating the size of the human interactome. Proc Natl Acad Sci U S A 105, 19 (May 13), 6959--6964.
[48]
Tanabe, L. and Wilbur, W.J., 2002. Tagging gene and protein names in biomedical text. Bioinformatics 18, 8 (Aug), 1124--1132.
[49]
Tikk, D., et al., 2010. A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature. PLoS Comput Biol 6, 7, e1000837.
[50]
Tuncbag, N., et al., 2011. Predicting protein-protein interactions on a proteome scale by matching evolutionary and structural similarities at interfaces using PRISM. Nat Protoc 6, 9 (Sep), 1341--1354.
[51]
Uniprot, C., 2013. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res 41, Database issue (Jan), D43--47.
[52]
Van Landeghem, S., et al., 2013. Large-scale event extraction from literature with multi-level gene normalization. PLoS One 8, 4, e55814.
[53]
Wang, H., et al., 2013. Predicting protein-protein interactions from multimodal biological data sources via nonnegative matrix tri-factorization. J Comput Biol 20, 4 (Apr), 344--358.
[54]
Wheeler, D.L., et al., 2003. Database resources of the National Center for Biotechnology. Nucleic Acids Res 31, 1 (Jan 1), 28--33.
[55]
Wishart, D.S., et al., 2009. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res 37, Database issue (Jan), D603--610.
[56]
Xie, Z., et al., 2010. hPDI: a database of experimental human protein-DNA interactions. Bioinformatics 26, 2 (Jan 15), 287--289.
[57]
You, Z.H., et al., 2013. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC bioinformatics 14 Suppl 8, S10.
[58]
Zhang, Q.C., et al., 2012. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, 7421 (Oct 25), 556--560.
[59]
Zitnik, M., et al., 2013. Discovering disease-disease associations by fusing systems-level molecular data. Sci Rep 3, 3202.

Cited By

View all
  • (2023)IBM Watson AI-enhanced search tool identifies novel candidate genes and provides insight into potential pathomechanisms of traumatic heterotopic ossificationBurns Open10.1016/j.burnso.2023.07.0017:4(126-138)Online publication date: Oct-2023
  • (2023)Explainable Drug Repurposing in Context via Deep Reinforcement LearningThe Semantic Web10.1007/978-3-031-33455-9_1(3-20)Online publication date: 28-May-2023
  • (2021)A Systematic Analysis of Link Prediction in Complex NetworkIEEE Access10.1109/ACCESS.2021.30539959(20531-20541)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
      August 2015
      2378 pages
      ISBN:9781450336642
      DOI:10.1145/2783258
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 August 2015

      Check for updates

      Author Tags

      1. hypothesis generation.
      2. scientific discovery
      3. text mining

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      KDD '15
      Sponsor:

      Acceptance Rates

      KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)157
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 11 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)IBM Watson AI-enhanced search tool identifies novel candidate genes and provides insight into potential pathomechanisms of traumatic heterotopic ossificationBurns Open10.1016/j.burnso.2023.07.0017:4(126-138)Online publication date: Oct-2023
      • (2023)Explainable Drug Repurposing in Context via Deep Reinforcement LearningThe Semantic Web10.1007/978-3-031-33455-9_1(3-20)Online publication date: 28-May-2023
      • (2021)A Systematic Analysis of Link Prediction in Complex NetworkIEEE Access10.1109/ACCESS.2021.30539959(20531-20541)Online publication date: 2021
      • (2021)Predicting unknown directed links of conserved networks from flow dataJournal of Complex Networks10.1093/comnet/cnab0379:6Online publication date: 18-Nov-2021
      • (2021)Discovering Research Hypotheses in Social Science Using Knowledge Graph EmbeddingsThe Semantic Web10.1007/978-3-030-77385-4_28(477-494)Online publication date: 6-Jun-2021
      • (2019)A systematic review on literature-based discovery workflowPeerJ Computer Science10.7717/peerj-cs.2355(e235)Online publication date: 18-Nov-2019
      • (2019)Identification of pharmacodynamic biomarker hypotheses through literature analysis with IBM WatsonPLOS ONE10.1371/journal.pone.021461914:4(e0214619)Online publication date: 8-Apr-2019
      • (2019)What drives research efforts?Proceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00038(217-226)Online publication date: 2-Jun-2019
      • (2019)The Reciprocal Roles of Artificial Intelligence and Industrial-Organizational PsychologyThe Cambridge Handbook of Technology and Employee Behavior10.1017/9781108649636.004(38-56)Online publication date: 18-Feb-2019
      • (2019)A Persistent Homology Perspective to the Link Prediction ProblemComplex Networks and Their Applications VIII10.1007/978-3-030-36687-2_3(27-39)Online publication date: 26-Nov-2019
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media