[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3107411.3107483acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
short-paper

A Novel Approach for Classifying Gene Expression Data using Topic Modeling

Published: 20 August 2017 Publication History

Abstract

Understanding the role of differential gene expression in cancer etiology and cellular process is a complex problem that continues to pose a challenge due to sheer number of genes and inter-related biological processes involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to mitigate overfitting of high-dimensionality gene expression data and to facilitate understanding of the associated pathways. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. Here, we proposed to use LDA in clustering as well as in classification of cancer and healthy tissues using lung cancer and breast cancer messenger RNA (mRNA) sequencing data. We describe our study in three phases: clustering, classification, and gene interpretation. First, LDA is used as a clustering algorithm to group the data in an unsupervised manner. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Lastly, we present a functional analysis of the genes identified using a novel topic profile matrix formulation. This analysis identified several genes and pathways that could potentially be involved in differentiating tumor samples from normal. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.

References

[1]
F Azuaje 1999. Interpretation of genome expression patterns: computational challenges and opportunities. IEEE engineering in medicine and biology magazine: the quarterly magazine of the Engineering in Medicine & Biology Society, Vol. 19, 6 (1999), 119--119.
[2]
Kevin G Becker, Kathleen C Barnes, Tiffani J Bright, and S Alex Wang 2004. The genetic association database. Nature genetics, Vol. 36, 5 (2004), 431--432.
[3]
Manuele Bicego, Pietro Lovato, Barbara Oliboni, and Alessandro Perina 2010. Expression microarray classification using topic models Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, 1516--1520.
[4]
Halil Bisgin, Zhichao Liu, Hong Fang, Xiaowei Xu, and Weida Tong 2011. Mining FDA drug labels using an unsupervised learning technique-topic modeling. BMC bioinformatics, Vol. 12, 10 (2011), S11.
[5]
Christopher M Bishop. 2006. Pattern recognition. Machine Learning Vol. 128 (2006), 1--58.
[6]
David M Blei and John D Lafferty 2009. Topic models. Text mining: classification, clustering, and applications, Vol. 10, 71 (2009), 34.
[7]
Lars Bullinger, Konstanze Döhner, Eric Bair, Stefan Fröhling, Richard F Schlenk, Robert Tibshirani, Hartmut Döhner, and Jonathan R Pollack 2004. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. New England Journal of Medicine Vol. 350, 16 (2004), 1605--1616.
[8]
Xin Chen, Xiaohua Hu, Tze Yee Lim, Xiajiong Shen, EK Park, and Gail L Rosen. 2012. Exploiting the functional and taxonomic structure of genomic data by probabilistic topic modeling. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Vol. 9, 4 (2012), 980--991.
[9]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, Vol. 41, 6 (1990), 391.
[10]
Glynn Dennis, Brad T Sherman, Douglas A Hosack, Jun Yang, Wei Gao, H Clifford Lane, and Richard A Lempicki 2003. DAVID: database for annotation, visualization, and integrated discovery. Genome biology, Vol. 4, 9 (2003), R60.
[11]
Thomas Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine learning, Vol. 42, 1 (2001), 177--196.
[12]
Halliday A Idikio. 2011. Human cancer classification: a systems biology-based model integrating morphology, cancer stem cells, proteomics, and genomics. Journal of Cancer, Vol. 2, 1 (2011), 107--115.
[13]
Lin Liu, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou 2016. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus, Vol. 5, 1 (2016), 1608.
[14]
Wenhan Luo, Björn Stenger, Xiaowei Zhao, and Tae-Kyun Kim 2015. Automatic Topic Discovery for Multi-Object Tracking. AAAI. 3820--3826.
[15]
Vivien Marx. 2013. Biology: The big challenges of big data. Nature, Vol. 498, 7453 (2013), 255--260.
[16]
Tomonari Masada, Tsuyoshi Hamada, Yuichiro Shibata, and Kiyoshi Oguri 2009. Bayesian multi-topic microarray analysis with hyperparameter reestimation International Conference on Advanced Data Mining and Applications. Springer, 253--264.
[17]
Matthew Meyerson, Stacey Gabriel, and Gad Getz 2010. Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics Vol. 11, 10 (2010), 685--696.
[18]
Iver Petersen. 2011. The morphological and molecular diagnosis of lung cancer. Dtsch Arztebl Int, Vol. 108, 31--32 (2011), 525--531.
[19]
Naruemon Pratanwanich and Pietro Lio 2014. Exploring the complexity of pathway--drug relationships using latent Dirichlet allocation. Computational biology and chemistry Vol. 53 (2014), 144--152.
[20]
Sridhar Ramaswamy, Ken N Ross, Eric S Lander, and Todd R Golub 2003. A molecular signature of metastasis in primary solid tumors. Nature genetics, Vol. 33, 1 (2003), 49--54.
[21]
Radim Rehurek. 2008. Gensim. (2008).
[22]
Simon Rogers, Mark Girolami, Colin Campbell, and Rainer Breitling 2005. The latent process decomposition of cDNA microarray data sets. IEEE/ACM transactions on computational biology and bioinformatics, Vol. 2, 2 (2005), 143--156.
[23]
Janne Sinkkonen, Juuso Parkkinen, Janne Aukia, and Samuel Kaski 2008. A simple infinite topic mixture for rich graphs and relational data. (2008).
[24]
Min Song and Su Yeon Kim 2013. Detecting the knowledge structure of bioinformatics by mining full-text collections. Scientometrics, Vol. 96, 1 (2013), 183--201.
[25]
Hongning Wang, Minlie Huang, and Xiaoyan Zhu. 2009. Extract interaction detection methods from the biological literature. BMC bioinformatics, Vol. 10, 1 (2009), S55.
[26]
Hongwei Wang, Qiang Sun, Wenyuan Zhao, Lishuang Qi, Yunyan Gu, Pengfei Li, Mengmeng Zhang, Yang Li, Shu-Lin Liu, and Zheng Guo. 2014. Individual-level analysis of differential expression of genes and pathways for personalized medicine. Bioinformatics (2014), btu522.
[27]
Weizhong Zhao, James J Chen, Roger Perkins, Yuping Wang, Zhichao Liu, Huixiao Hong, Weida Tong, and Wen Zou. 2016. A novel procedure on next generation sequencing data analysis using text mining algorithm. BMC bioinformatics, Vol. 17, 1 (2016), 1.
[28]
Weizhong Zhao, Wen Zou, and James J Chen 2014. Topic modeling for cluster analysis of large biological and medical datasets. BMC bioinformatics, Vol. 15, 11 (2014), S11. endthebibliography

Cited By

View all
  • (2024)SVAD: Stacked Variational Autoencoder Deep Neural Network-Based Dimensionality Reduction and Classification of Small Sample Size and High Dimensional DataSN Computer Science10.1007/s42979-024-03294-25:7Online publication date: 12-Oct-2024
  • (2023)Topic modeling algorithms and applicationsInformation Systems10.1016/j.is.2022.102131112:COnline publication date: 1-Feb-2023
  • (2023)A Novel Approach for Visualizing Medical Big Data Using Variational AutoencodersICDSMLA 202110.1007/978-981-19-5936-3_31(337-346)Online publication date: 7-Feb-2023
  • Show More Cited By

Index Terms

  1. A Novel Approach for Classifying Gene Expression Data using Topic Modeling

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ACM-BCB '17: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
      August 2017
      800 pages
      ISBN:9781450347228
      DOI:10.1145/3107411
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 August 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cancer
      2. classification
      3. clustering
      4. gene expression
      5. latent dirichlet allocation
      6. machine learning
      7. topic modeling

      Qualifiers

      • Short-paper

      Conference

      BCB '17
      Sponsor:

      Acceptance Rates

      ACM-BCB '17 Paper Acceptance Rate 42 of 132 submissions, 32%;
      Overall Acceptance Rate 254 of 885 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 14 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)SVAD: Stacked Variational Autoencoder Deep Neural Network-Based Dimensionality Reduction and Classification of Small Sample Size and High Dimensional DataSN Computer Science10.1007/s42979-024-03294-25:7Online publication date: 12-Oct-2024
      • (2023)Topic modeling algorithms and applicationsInformation Systems10.1016/j.is.2022.102131112:COnline publication date: 1-Feb-2023
      • (2023)A Novel Approach for Visualizing Medical Big Data Using Variational AutoencodersICDSMLA 202110.1007/978-981-19-5936-3_31(337-346)Online publication date: 7-Feb-2023
      • (2020)Prediction of lncRNA-Cancer Association Using Topic Model on GraphsAdvances in Machine Learning and Computational Intelligence10.1007/978-981-15-5243-4_28(311-319)Online publication date: 26-Jul-2020
      • (2019)High-Dimensional Limited-Sample Biomedical Data Classification Using Variational AutoencoderAlpha-Synuclein10.1007/978-981-13-6661-1_3(30-42)Online publication date: 16-Feb-2019
      • (2018)Discovering Student Behavior Patterns from Event Logs: Preliminary Results on a Novel Probabilistic Latent Variable Model2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT)10.1109/ICALT.2018.00056(207-211)Online publication date: Jul-2018
      • (2017)Latent Dirichlet Allocation for Classification using Gene Expression Data2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)10.1109/BIBE.2017.00-81(39-44)Online publication date: Oct-2017

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media