Abstract
Alignment is the first step in most RNA-seq analysis pipelines, and the accuracy of downstream analyses depends heavily on it. Unlike most steps in the pipeline, alignment is particularly amenable to benchmarking with simulated data. We performed a comprehensive benchmarking of 14 common splice-aware aligners for base, read, and exon junction-level accuracy and compared default with optimized parameters. We found that performance varied by genome complexity, and accuracy and popularity were poorly correlated. The most widely cited tool underperforms for most metrics, particularly when using default settings.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
£169.00 per year
only £14.08 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Change history
22 December 2016
In the version of this analysis initially published online, the first sentence of Supplementary Note 7 was incorrect; it has been corrected to read "Computational performance refers to how long it takes the alignment to run and how much memory it requires." Supplementary Note 7 has also been removed and its text included in the new Supplementary Note 8. The format for the supplementary information titles was incorrect; these have been updated to the standard format. The supplementary figures and notes have been renumbered to reflect callouts in the main text. The supplementary figures have been renumbered: Supplementary Figures 6–14 are now Supplementary Figures 2–10, Supplementary Figure 5 is now Supplementary Figure 11, and Supplementary Figures 2–4 are now Supplementary Figures 12–14. The supplementary notes have also been renumbered: Supplementary Note 5 is now Supplementary Note 1, Supplementary Note 1 is now Supplementary Note 2, Supplementary Note 10 is now Supplementary Note 3, Supplementary Notes 2–6 are now Supplementary Notes 4–7, and Supplementary Note 11 is now Supplementary Note 10. These errors have been corrected in this file as of 22 December 2016.
References
Hayer, K.E., Pizarro, A., Lahens, N.F., Hogenesch, J.B. & Grant, G.R. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31, 3938–3945 (2015).
Bonfert, T., Kirner, E., Csaba, G., Zimmer, R. & Friedel, C.C. ContextMap 2: fast and accurate context-based RNA-seq mapping. BMC Bioinformatics 16, 122 (2015).
Philippe, N., Salson, M., Commes, T. & Rivals, E. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biol. 14, R30 (2013).
Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).
Kim, D., Langmead, B. & Salzberg, S.L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Wang, K. et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178 (2010).
Wu, J., Anczuków, O., Krainer, A.R., Zhang, M.Q. & Zhang, C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 41, 5149–5163 (2013).
Grant, G.R. et al. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 27, 2518–2528 (2011).10.1093/bioinformatics/btr427
Huang, S. et al. SOAPsplice: Genome-wide ab initio detection of splice junctions from RNA-Seq data. Front. Genet. 2, 46 (2011).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Liao, Y., Smyth, G.K. & Shi, W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 41, e108 (2013).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013).
Aurrecoechea, C. et al. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 37, D539–D543 (2009).
Glenn, T.C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11, 759–769 (2011).
Wang, W.-A. et al. Comparisons and performance evaluations of RNA-seq alignment tools in 2014 International Conference on Electrical Engineering and Computer Science 215–218 (ICEECS, 2014).
Benjamin, A.M., Nichols, M., Burke, T.W., Ginsburg, G.S. & Lucas, J.E. Comparing reference-based RNA-Seq mapping methods for non-human primate data. BMC Genomics 15, 570 (2014).
Fonseca, N.A., Rung, J., Brazma, A. & Marioni, J.C. Tools for mapping high-throughput sequencing data. Bioinformatics 28, 3169–3177 (2012).
Fonseca, N.A., Marioni, J. & Brazma, A. RNA-Seq gene profiling—a systematic empirical comparison. PLoS One 9, e107026 (2014).
Gardner, M.J. et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498–511 (2002).
Lindner, R. & Friedel, C.C. A comprehensive evaluation of alignment algorithms in the context of RNA-seq. PLoS One 7, e52403 (2012).
Hatem, A., Bozdagˇ, D., Toland, A.E. & Çatalyürek, U.V. Benchmarking short sequence mapping tools. BMC Bioinformatics 14, 184 (2013).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Acknowledgements
We thank A. Srinivasan for his help administrating the PMACS cluster. We thank N. Lahens, T. Grosser, D. Sarantopoulou, F. Coldren, E. Scarci, and E. Ricciotti for support and helpful discussions. This work was funded in part by the National Heart Lung and Blood Institute (U54HL117798, G.A.F.) and The National Center for Advancing Translational Sciences (UL1-TR-001878, G.A.F.).
Author information
Authors and Affiliations
Contributions
G.B. contributed research, analysis, and writing. K.E.H. contributed analysis, figures, and benchmarking scripts. E.J.K. contributed analysis. B.D.C. contributed analysis and formulation of ideas. G.A.F. contributed formulation of ideas and direction. G.R.G. contributed the simulated data, direction, ideas, and writing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–15, Supplementary Notes 1–10 and Supplementary Tables 1–43. (PDF 6431 kb)
Supplementary Data 1
Information about the tools involved in the comparison. (XLSX 54 kb)
Supplementary Data 2
Statistics and accuracy metrics of tweaked alignment on Human. (XLSX 59 kb)
Supplementary Data 3
Statistics and accuracy metrics of tweaked alignment on Malaria. (XLSX 1629 kb)
Supplementary Data 4
Statistics and accuracy metrics of default alignment on Human and Malaria (latest tool versions). (XLSX 65 kb)
Supplementary Data 5
Statistics and accuracy metrics of default alignment on Human. (XLSX 17 kb)
Supplementary Data 6
Statistics and accuracy metrics of default alignment on Malaria. (XLSX 17 kb)
Supplementary Data 7
Statistics and accuracy metrics achieved by the best tweaked alignment on Human. (XLSX 26 kb)
Supplementary Data 8
Statistics and accuracy metrics achieved by the best tweaked alignment on Malaria. (XLSX 72 kb)
Supplementary Data 9
Statistics and accuracy metrics of default alignment on Human including/omitting annotation. (XLSX 50 kb)
Supplementary Data 10
Statistics and accuracy metrics of default alignment on Malaria including/omitting annotation. (XLSX 31 kb)
Supplementary Data 11
Computational performance metrics of default alignment on Human. (XLS 78 kb)
Supplementary Data 12
Computational performance metrics of default alignment on Malaria. (XLS 78 kb)
Supplementary Data 13
Statistics and accuracy metrics of short anchored reads alignment on Human. (XLSX 310 kb)
Supplementary Data 14
Statistics and accuracy metrics of simulated adapters alignment on Human. (XLSX 313 kb)
Supplementary Data 15
Statistics and accuracy metrics of canonical and noncanonical junctions on Human. (XLSX 148 kb)
Supplementary Software
All scripts used in this analysis. (ZIP 3592 kb)
Rights and permissions
About this article
Cite this article
Baruzzo, G., Hayer, K., Kim, E. et al. Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods 14, 135–139 (2017). https://doi.org/10.1038/nmeth.4106
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.4106
This article is cited by
-
Splice_sim: a nucleotide conversion-enabled RNA-seq simulation and evaluation framework
Genome Biology (2024)
-
A comprehensive workflow for optimizing RNA-seq data analysis
BMC Genomics (2024)
-
Machine learning on alignment features for parent-of-origin classification of simulated hybrid RNA-seq
BMC Bioinformatics (2024)
-
A real-world multi-center RNA-seq benchmarking study using the Quartet and MAQC reference materials
Nature Communications (2024)
-
Challenges and best practices in omics benchmarking
Nature Reviews Genetics (2024)