PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats

Abstract

Variable-Number Tandem Repeats (VNTR) are genomic regions where a short sequence of DNA is repeated with no space in between repeats. While a fixed set of VNTRs is typically identified for a given species, the copy number at each VNTR varies between individuals within a species. Although VNTRs are found in both prokaryotic and eukaryotic genomes, the methodology called multi-locus VNTR analysis (MLVA) is widely used to distinguish different strains of bacteria, as well as cluster strains that might be epidemiologically related and investigate evolutionary rates.
We propose PRINCE (Processing Reads to Infer the Number of Copies via Estimation), an algorithm that is able to accurately estimate the copy number of a VNTR given the sequence of a single repeat unit and a set of short reads from a whole-genome sequence (WGS) experiment. This is a challenging problem, especially in the cases when the repeat region is longer than the expected read length. Our proposed method computes a statistical approximation of the local coverage inside the repeat region. This approximation is then mapped to the copy number using a linear function whose parameters are fitted to simulated data. We test PRINCE on the genomes of three datasets of Mycobacterium tuberculosis strains and show that it is more than twice as accurate as a previous method.
An implementation of PRINCE in the Python language is freely available at https://github.com/WGS-TB/PythonPRINCE.

A Abyzov, A E Urban, M Snyder, and M Gerstein. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research, 21(6):974-984, 2011.
Lindstedt B. Multiple-locus variable number tandem repeats analysis for genetic fingerprinting of pathogenic bacteria. Electrophoresis, 26(13):2567-2582, 2005.
M Bakhtiari, S Shleizer-Burko, M Gymrek, V Bansal, and V Bafna. Targeted genotyping of variable number tandem repeats with adVNTR. bioRxiv, 2017.
MD Cao, E Tasker, K Willadsen, M Imelfort, S Vishwanathan, et al. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Research, 42(3):e16-e16, 2013.
F Coll, K Mallard, MD Preston, S Bentley, J Parkhill, et al. SpolPred: rapid and accurate prediction of Mycobacterium tuberculosis spoligotypes from short genomic sequences. Bioinformatics, 28(22):2991-2993, 2012.
JL De Beer, K Kremer, C Ködmön, P Supply, D Van Soolingen, Global Network for the Molecular Surveillance of Tuberculosis 2009, et al. First worldwide proficiency study on variable-number tandem-repeat typing of Mycobacterium tuberculosis complex strains. Journal of Clinical Microbiology, 50(3):662-669, 2012.
E Dolzhenko, JJFA van Vugt, RJ Shaw, MA Bekritsky, M van Blitterswijk, et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Research, 27(11):1895-1903, 2017.
M Escalona, S Rocha, and D Posada. A comparison of tools for the simulation of genomic next-generation sequencing data. Nature Reviews Genetics, 17(8):459, 2016.
B Ewing, L Hillier, MC Wendl, and P Green. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Research, 8(3):175-185, 1998.
J Friedman, T Hastie, and R Tibshirani. The Elements of Statistical Learning, volume 1. Springer Series in Statistics New York, 2001.
R Frothingham and WA Meeker-O'Connell. Genetic diversity in the Mycobacterium tuberculosis complex based on variable numbers of tandem DNA repeats. Microbiology, 144(5):1189-1196, 1998.
Y Gelfand, Y Hernandez, J Loving, and G Benson. VNTRseek - a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Research, 42(14):8884-8894, 2014.
S Goodwin, JD McPherson, and WR McCombie. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333-351, 2016.
JL Guthrie, C Kong, D Roth, D Jorgensen, M Rodrigues, et al. Molecular epidemiology of tuberculosis in British Columbia, Canada-a 10-year retrospective study. Clinical Infectious Diseases, 2017.
M Gymrek, D Golan, S Rosset, and Y Erlich. lobSTR: a short tandem repeat profiler for personal genomes. Genome Research, 22(6):1154-1162, 2012.
W Huang, L Li, JR Myers, and GT Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593-594, 2012.
T Jagielski, J van Ingen, N Rastogi, J Dziadek, PK Mazur, and J Bielecki. Current methods in the molecular typing of Mycobacterium tuberculosis and other mycobacteria. BioMed Research International, 2014(645802), 2014.
P Liao, GA Satten, and Y Hu. PhredEM: a Phred-score-informed genotype-calling approach for next-generation sequencing studies. Genetic Epidemiology, 41(5):375-387, 2017.
B Mathema, NE Kurepina, PJ Bifani, and BN Kreiswirth. Molecular epidemiology of Tuberculosis: Current Insights. Clinical Microbiology Reviews, 19(4):658-685, 2006.
CJ Meehan, P Moris, TA Kohl, J Pečerska, S Akter, et al. The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology. bioRxiv, 2018.
M Merker, C Blin, S Mona, N Duforet-Frebourg, S Lecher, et al. Evolutionary history and global spread of the Mycobacterium tuberculosiseijing lineage. Nature Genetics, 47(3):242-249, 2015.
T Miyoshi-Akiyama, K Satou, M Kato, A Shiroma, K Matsumura, et al. Complete annotated genome sequence of Mycobacterium tuberculosis (Zopf) Lehmann and Neumann (ATCC35812)(Kurono). Tuberculosis, 95(1):37-39, 2015.
CA Nadon, E Trees, LK Ng, E Møller Nielsen, A Reimer, et al. Development and application of MLVA methods as a tool for inter-laboratory surveillance. Euro Surveillance, 18(35), 2013.
V Nikolayevskyy, A Trovato, A Broda, E Borroni, D Cirillo, and F Drobniewski. MIRU-VNTR genotyping of Mycobacterium tuberculosis strains using QIAxcel technology: A multicentre evaluation study. PLoS One, 11(3):e0149435, 2016.
JG Rodríguez, C Pino, A Tauch, and MI Murcia. Complete genome sequence of the clinical Beijing-like strain Mycobacterium tuberculosis 323 using the PacBio real-time sequencing platform. Genome Announcements, 3(2):e00371-15, 2015.
MG Ross, C Russ, M Costello, A Hollinger, NJ Lennon, et al. Characterizing and measuring bias in sequence data. Genome Biology, 14(5):R51, 2013.
SL Salzberg and JA Yorke. Beware of mis-assembled genomes. Bioinformatics, 21(24):4320-4321, 2005.
T Sekizuka, A Yamashita, Y Murase, T Iwamoto, S Mitarai, S Kato, and M Kuroda. TGS-TB: Total genotyping solution for Mycobacterium tuberculosissing Short-Read Whole-Genome Sequencing. PLoS One, 10(11):e0142951, 2015.
P Supply. Multilocus Variable Number Tandem Repeat genotyping of Mycobacterium tuberculosis. Technical report, Institut de Biologie/Institut Pasteur de Lille, 2005.
P Supply, C Allix, S Lesjean, M Cardoso-Oelemann, S Rüsch-Gerdes, et al. Proposal for standardization of optimized mycobacterial interspersed repetitive unit-variable-number tandem repeat typing of Mycobacterium tuberculosis. Journal of Clinical Microbiology, 44(12):4498-4510, 2006.
DW Ussery, TM Wassenaar, and S Borini. Computing for Comparative Microbial Genomics: Bioinformatics for Microbiologists, volume 8 of Computational Biology. Springer, 2009.
Z Wang, F Hormozdiari, W Yang, E Halperin, and E Eskin. CNVeM: copy number variation detection using uncertainty of read mapping. Journal of Computational Biology, 20(3):224-236, 2013.
T Willems, D Zielinski, J Yuan, A Gordon, M Gymrek, and Y Erlich. Genome-wide profiling of heritable and de novo STR variations. Nature Methods, 14(6):590, 2017.
AE Woerner, JL King, and B Budowle. Fast STR allele identification with STRait Razor 3.0. Forensic Science International: Genetics, 30:18-23, 2017.
S Yoon, Z Xuan, V Makarov, K Ye, and J Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19(9):1586-1592, 2009.
M Zhao, Q Wang, Q Wang, P Jia, and Z Zhao. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 14(11):S1, 2013.

PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats

Authors Mehrdad Mansouri, Julian Booth, Margaryta Vityaz, Cedric Chauve, Leonid Chindelevitch

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats

Authors Mehrdad Mansouri, Julian Booth, Margaryta Vityaz, Cedric Chauve, Leonid Chindelevitch

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message