TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences
"> Figure 1
<p>The datasets used in this study. The features include DNA sequences, half-life data, and transcription factor target data. The labels correspond to gene expression values.</p> "> Figure 2
<p>Performance comparison between DeepLncLoc and TExCNN. Different feature configurations are evaluated, including (<b>a</b>) DNA sequence features only, (<b>b</b>) DNA sequence and mRNA half-life features, and (<b>c</b>) DNA sequence, mRNA half-life, and TF features. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The vertical axis shows the R<sup>2</sup> values.</p> "> Figure 3
<p>Evaluation of different feature combinations on the extended dataset. Different feature configurations have been evaluated, including (1) DNA sequence features only (DNA), (2) DNA sequence and mRNA half-life features (DNA + half-life), and (3) DNA sequence, mRNA half-life, and TF features (DNA + half-life + TF). The comparison is between the TExCNN models with (<b>a</b>) 10,500-bp and (<b>b</b>) 50,000-bp input lengths. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The experiment is conducted on the extended dataset with 50,000-bp sequence length. The vertical axis gives the R<sup>2</sup> values.</p> "> Figure 4
<p>The R<sup>2</sup> values of predicting gene expression levels on 57 cells and tissues.</p> "> Figure 5
<p>Evaluation of TExCNN trained on four different datasets. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The vertical axis gives the R<sup>2</sup> values.</p> "> Figure A1
<p>The comparison experiment between different pre-processing methods. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The vertical axis gives the R<sup>2</sup> values.</p> "> Figure A2
<p>The ablation experiment of TExCNN. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The vertical axis gives the R<sup>2</sup> values.</p> "> Figure A3
<p>The results of the 10-fold validation test. Different feature configurations have been evaluated, including (1) DNA sequence features only (DNA), (2) DNA sequence and mRNA half-life features (DNA + half-life), and (3) DNA sequence, mRNA half-life, and TF features (DNA + half-life + TF). The comparison is between the TExCNN models with (<b>a</b>) 10,500-bp and (<b>b</b>) 50,000-bp input lengths. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The vertical axis gives the R<sup>2</sup> values.</p> "> Figure A4
<p>The results for the independent dataset. Different feature configurations have been evaluated, including (1) DNA sequence features only (DNA), (2) DNA sequence and mRNA half-life features (DNA + half-life), and (3) DNA sequence, mRNA half-life, and TF features (DNA + half-life + TF). The comparison is between the TExCNN models with (<b>a</b>) 10,500-bp and (<b>b</b>) 50,000-bp input lengths. The horizontal axis represents the minimum (Min), average (Avg), and maximum (Max) values of each model’s 10 independent runs. The vertical axis gives the R<sup>2</sup> values.</p> ">
Abstract
:1. Introduction
2. Datasets and Methods
2.1. Xpresso Dataset
2.2. DeepLncLoc Dataset
2.3. Extended Dataset
2.4. Pre-Trained Models
2.5. Sequence Representations
2.6. TExCNN Model with a 10,500-bp Input Length
2.7. TExCNN Model with a 50,000-bp Input Length
- Word vectors from the 10,000 bp upstream and 10,000 bp downstream sequences generated by DNABERT;
- Word vectors from the 10,000 bp upstream and 10,000 bp downstream sequences generated by DNABERT-2;
- Word vectors from the 25,000 bp upstream and 25,000 bp downstream sequences generated by DNABERT;
- Word vectors from the 25,000 bp upstream and 25,000 bp downstream sequences generated by DNABERT-2.
3. Results
3.1. Experimental Settings
3.2. Comparison Between TExCNN, DeepLncLoc, and DNAPerceiver
3.3. Evaluation of Different Feature Combinations
3.4. Evaluation of Different Pre-Trained Model Combinations
3.5. Evaluation of Extended Input Length
3.6. Application of TExCNN Models on Different Cells and Tissues
3.7. Which Part of the Sequence Affects Gene Regulation the Most
- The 3000-bp upstream and 3000-bp downstream sequences;
- The 6000-bp upstream and 6000-bp downstream sequences;
- The 10,000-bp upstream and 10,000-bp downstream sequences;
- The 25,000-bp upstream and 25,000-bp downstream sequences.
4. Discussion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
- The methods in Section 2.5;
- When leveraging DNABERT-2, splitting the sequence into fragments of 2000 bp with a 250 bp step size;
- When leveraging DNABERT-2, splitting the sequence into fragments of 2000 bp with a 1000 bp step size;
- Using the same splitting methods in Section 2.5 and replacing the average vectors of embeddings with the maximum vectors of embeddings.
Appendix B
Appendix C
Appendix D
References
- Gann, A. Jacob and Monod: From operons to EvoDevo. Curr. Biol. 2010, 20, R718–R723. [Google Scholar] [CrossRef] [PubMed]
- Buccitelli, C.; Selbach, M. mRNAs, proteins and the emerging principles of gene expression control. Nat. Rev. Genet. 2020, 21, 630–644. [Google Scholar] [CrossRef] [PubMed]
- Schwanhäusser, B.; Busse, D.; Li, N.; Dittmar, G.; Schuchhardt, J.; Wolf, J.; Chen, W.; Selbach, M. Global quantification of mammalian gene expression control. Nature 2011, 473, 337–342. [Google Scholar] [CrossRef] [PubMed]
- Eraslan, B.; Wang, D.; Gusic, M.; Prokisch, H.; Hallström, B.M.; Uhlén, M.; Asplund, A.; Pontén, F.; Wieland, T.; Hopf, T. Quantification and discovery of sequence determinants of protein-per-mRNA amount in 29 human tissues. Mol. Syst. Biol. 2019, 15, e8513. [Google Scholar] [CrossRef]
- Lambert, S.A.; Jolma, A.; Campitelli, L.F.; Das, P.K.; Yin, Y.; Albu, M.; Chen, X.; Taipale, J.; Hughes, T.R.; Weirauch, M.T. The human transcription factors. Cell 2018, 172, 650–665. [Google Scholar] [CrossRef]
- Gasperini, M.; Hill, A.J.; McFaline-Figueroa, J.L.; Martin, B.; Kim, S.; Zhang, M.D.; Jackson, D.; Leith, A.; Schreiber, J.; Noble, W.S. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 2019, 176, 377–390.e319. [Google Scholar] [CrossRef]
- Sahu, B.; Hartonen, T.; Pihlajamaa, P.; Wei, B.; Dave, K.; Zhu, F.; Kaasinen, E.; Lidschreiber, K.; Lidschreiber, M.; Daub, C.O. Sequence determinants of human gene regulatory elements. Nat. Genet. 2022, 54, 283–294. [Google Scholar] [CrossRef]
- Zhou, J.; Theesfeld, C.L.; Yao, K.; Chen, K.M.; Wong, A.K.; Troyanskaya, O.G. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 2018, 50, 1171–1179. [Google Scholar] [CrossRef]
- Agarwal, V.; Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 2020, 31, 107663. [Google Scholar] [CrossRef]
- Stefanini, M.; Lovino, M.; Cucchiara, R.; Ficarra, E. Predicting gene and protein expression levels from DNA and protein sequences with Perceiver. Comput. Methods Programs Biomed. 2023, 234, 107504. [Google Scholar] [CrossRef]
- Avsec, Ž.; Agarwal, V.; Visentin, D.; Ledsam, J.R.; Grabska-Barwinska, A.; Taylor, K.R.; Assael, Y.; Jumper, J.; Kohli, P.; Kelley, D.R. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 2021, 18, 1196–1203. [Google Scholar] [CrossRef] [PubMed]
- Kelley, D.R.; Reshef, Y.A.; Bileschi, M.; Belanger, D.; McLean, C.Y.; Snoek, J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018, 28, 739–750. [Google Scholar] [PubMed]
- Gao, S.; Rehman, J.; Dai, Y. Assessing comparative importance of DNA sequence and epigenetic modifications on gene expression using a deep convolutional neural network. Comput. Struct. Biotechnol. J. 2022, 20, 3814–3823. [Google Scholar]
- Kelley, D.R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 2020, 16, e1008050. [Google Scholar]
- Pipoli, V.; Cappelli, M.; Palladini, A.; Peluso, C.; Lovino, M.; Ficarra, E. Predicting gene expression levels from dna sequences and post-transcriptional information with transformers. Comput. Methods Programs Biomed. 2022, 225, 107035. [Google Scholar]
- Zeng, M.; Wu, Y.; Lu, C.; Zhang, F.; Wu, F.-X.; Li, M. DeepLncLoc: A deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Briefings Bioinform. 2022, 23, bbab360. [Google Scholar]
- Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General perception with iterative attention. In Proceedings of the International conference on machine learning, Online, 18–24 July 2021; pp. 4651–4664. [Google Scholar]
- Karollus, A.; Mauermeier, T.; Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 2023, 24, 56. [Google Scholar]
- Ji, Y.; Zhou, Z.; Liu, H.; Davuluri, R.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar]
- Zhou, Z.; Ji, Y.; Li, W.; Dutta, P.; Davuluri, R.; Liu, H. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv 2023, arXiv:2306.15006. [Google Scholar]
- Kundaje, A.; Meuleman, W.; Ernst, J.; Bilenky, M.; Yen, A.; Heravi-Moussavi, A.; Kheradpour, P.; Zhang, Z.; Wang, J.; Ziller, M.J. Integrative analysis of 111 reference human epigenomes. Nature 2015, 518, 317–330. [Google Scholar] [CrossRef]
- Nurk, S.; Koren, S.; Rhie, A.; Rautiainen, M.; Bzikadze, A.V.; Mikheenko, A.; Vollger, M.R.; Altemose, N.; Uralsky, L.; Gershman, A.; et al. The complete sequence of a human genome. Science 2022, 376, 44–53. [Google Scholar] [CrossRef] [PubMed]
- Griseri, P.; Pagès, G. Regulation of the mRNA half-life in breast cancer. World J. Clin. Oncol. 2014, 5, 323. [Google Scholar] [PubMed]
- MacQuarrie, K.L.; Fong, A.P.; Morse, R.H.; Tapscott, S.J. Genome-wide transcription factor binding: Beyond direct target regulation. Trends Genet. 2011, 27, 141–148. [Google Scholar] [CrossRef] [PubMed]
- Grosveld, F.; van Staalduinen, J.; Stadhouders, R. Transcriptional regulation by (super) enhancers: From discovery to mechanisms. Annu. Rev. Genom. Hum. Genet. 2021, 22, 127–146. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Dalla-Torre, H.; Gonzalez, L.; Mendoza-Revilla, J.; Carranza, N.L.; Grzywaczewski, A.H.; Oteri, F.; Dallago, C.; Trop, E.; de Almeida, B.P.; Sirelkhatim, H. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv 2023. bioxriv2011.523679. [Google Scholar] [CrossRef]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Abdulrazzaq, M.M.; Ramaha, N.T.A.; Hameed, A.A.; Salman, M.; Yon, D.K.; Fitriyani, N.L.; Syafrudin, M.; Lee, S.W. Consequential Advancements of Self-Supervised Learning (SSL) in Deep Learning Contexts. Mathematics 2024, 12, 758. [Google Scholar] [CrossRef]
- Kaggle. Available online: https://www.kaggle.com/datasets/lachmann12/human-liver-rnaseq-gene-expression-903-samples (accessed on 2 December 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dong, G.; Wu, Y.; Huang, L.; Li, F.; Zhou, F. TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences. Genes 2024, 15, 1593. https://doi.org/10.3390/genes15121593
Dong G, Wu Y, Huang L, Li F, Zhou F. TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences. Genes. 2024; 15(12):1593. https://doi.org/10.3390/genes15121593
Chicago/Turabian StyleDong, Guohao, Yuqian Wu, Lan Huang, Fei Li, and Fengfeng Zhou. 2024. "TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences" Genes 15, no. 12: 1593. https://doi.org/10.3390/genes15121593
APA StyleDong, G., Wu, Y., Huang, L., Li, F., & Zhou, F. (2024). TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences. Genes, 15(12), 1593. https://doi.org/10.3390/genes15121593