More Web Proxy on the site http://driver.im/

research-article

WHAM: a high-throughput sequence alignment method

Authors:

Allison Terrell,

Jignesh M. PatelAuthors Info & Claims

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Pages 445 - 456

https://doi.org/10.1145/1989323.1989370

Published: 12 June 2011 Publication History

Abstract

Over the last decade the cost of producing genomic sequences has dropped dramatically due to the current so called "next-gen" sequencing methods. However, these next-gen sequencing methods are critically dependent on fast and sophisticated data processing methods for aligning a set of query sequences to a reference genome using rich string matching models. The focus of this work is on the design, development and evaluation of a data processing system for this crucial "short read alignment" problem. Our system, called WHAM, employs novel hash-based indexing methods and bitwise operations for sequence alignments. It allows richer match models than existing methods and it is significantly faster than the existing state-of-the-art method. In addition, its relative speedup over the existing method is poised to increase in the future in which read sequence lengths will increase. The WHAM code is available at http://www.cs.wisc.edu/wham/.

References

[1]

White House Press Release. Retrieved 2006-07--22. http://www.ornl.gov/sci/techresources/Human-Genome/project/clinton1.shtml.

[2]

M.-S. K. 0002, K.-Y. Whang, J.-G. Lee, and M.-J. Lee. n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In VLDB, pages 325--336, 2005.

Digital Library

[3]

S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.

[4]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[5]

R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74--82, 1992.

Digital Library

[6]

L. Ben, T. Cole, P. Mihai, and S. Steven. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 10(3):R25, 2009.

[7]

M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Digital SRC Research Report, 1994.

[8]

T. Han, S. Ko, and J. Kang. Efficient Subsequence Matching Using the Longest Common Subsequence with a Dual Match Index. Machine Learning and Data Mining in Pattern Recognition, pages 585--600, 2007.

Digital Library

[9]

E. Karakoc, Z. Ozsoyoglu, S. Sahinalp, M. Tasan, and X. Zhang. Novel approaches to biomolecular sequence indexing. Data Engineering, 1001:40, 2004.

[10]

D. E. Knuth. The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Addison-Wesley Professional, Jan. 2011.

Digital Library

[11]

C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.

Digital Library

[12]

H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research, 18(11):1851, 2008.

[13]

R. Li, Y. Li, K. Kristiansen, and J. Wang. SOAP: short oligonucleotide alignment program. Bioinformatics, 24(5):713, 2008.

Digital Library

[14]

W. Litwin, R. Mokadem, P. Rigaux, and T. J. E. Schwarz. Fast ngram-based string search over data encoded using algebraic signatures. In VLDB, pages 207--218, 2007.

Digital Library

[15]

J. D. McPherson. Next-generation gap. Nature Methods, 6(11s):S2--S5, October 2009.

[16]

S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443--453, 1970.

[17]

P. Papapetrou, V. Athitsos, G. Kollios, and D. Gunopulos. Reference-based alignment in large sequence databases. PVLDB, 2(1):205--216, 2009.

Digital Library

[18]

J. Rao and K. A. Ross. Cache conscious indexing for decision-support in main memory. In VLDB, pages 78--89, 1999.

Digital Library

[19]

R. L. Rivest. Partial-match retrieval algorithms. SIAM J. Comput., 5(1):19--50, 1976.

Digital Library

[20]

T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.

[21]

Venter, et al. The Sequence of the Human Genome. Science, 291(5507):1304--1351, 2001.

[22]

S. Wu and U. Manber. Fast text searching allowing errors. Commun. ACM, 35(10):83--91, 1992.

Digital Library

[23]

X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In SIGMOD Conference, pages 353--364, 2008.

Digital Library

Cited By

Alser MRotman JDeshpande DTaraszka KShi HBaykal PYang HXue VKnyazev SSinger BBalliu BKoslicki DSkums PZelikovsky AAlkan CMutlu OMangul S(2021)Technology dictates algorithms: recent developments in read alignmentGenome Biology10.1186/s13059-021-02443-722:1Online publication date: 26-Aug-2021
https://doi.org/10.1186/s13059-021-02443-7
Lu MLuo Q(2019)Accelerating Large-Scale Genome-Wide Association Studies With Graphics ProcessorsBiotechnology10.4018/978-1-5225-8903-7.ch017(428-461)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-8903-7.ch017
Wei HYu JLu C(2018)String Similarity Search: A Hash-Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.275693230:1(170-184)Online publication date: 1-Jan-2018
https://doi.org/10.1109/TKDE.2017.2756932
Show More Cited By

Index Terms

WHAM: a high-throughput sequence alignment method
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Information retrieval query processing

Recommendations

WHAM: A High-Throughput Sequence Alignment Method

Over the last decade, the cost of producing genomic sequences has dropped dramatically due to the current so-called next-generation sequencing methods. However, these next-generation sequencing methods are critically dependent on fast and sophisticated ...
EXONSCAN: EXON prediction with Signal detection and Coding region AligNment in homologous sequences
SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

Identifying the protein coding genes in the genomic sequences is a very important application and challenging work. A great number of computational gene prediction programs have been proposed with satisfied sensitivity and specificity at nucleotide ...
A software system for gene sequence database construction based on fast approximate string matching

We propose a web-based software system for sequence acquisition and database construction. An example application of this system is to construct a ribosomal RNA gene (rDNA) sequence database to facilitate the study of microbial communities. A fast and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

June 2011

1364 pages

ISBN:9781450306614

DOI:10.1145/1989323

General Chair:
Timos Sellis
IMIS/RC Athena
,
Program Chair:
Renée J. Miller
University of Toronto
,
Publications Chairs:
Anastasios Kementsietsidis
IBM T.J. Watson Research Center
,
Yannis Velegrakis
University of Trento

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '11

Sponsor:

SIGMOD

SIGMOD/PODS '11: International Conference on Management of Data

June 12 - 16, 2011

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
371
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alser MRotman JDeshpande DTaraszka KShi HBaykal PYang HXue VKnyazev SSinger BBalliu BKoslicki DSkums PZelikovsky AAlkan CMutlu OMangul S(2021)Technology dictates algorithms: recent developments in read alignmentGenome Biology10.1186/s13059-021-02443-722:1Online publication date: 26-Aug-2021
https://doi.org/10.1186/s13059-021-02443-7
Lu MLuo Q(2019)Accelerating Large-Scale Genome-Wide Association Studies With Graphics ProcessorsBiotechnology10.4018/978-1-5225-8903-7.ch017(428-461)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-8903-7.ch017
Wei HYu JLu C(2018)String Similarity Search: A Hash-Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.275693230:1(170-184)Online publication date: 1-Jan-2018
https://doi.org/10.1109/TKDE.2017.2756932
Roy ADiao YEvani UAbhyankar AHowarth CLe Priol RBloom TChirkova RYang JSuciu D(2017)Massively Parallel Processing of Whole Genome Sequence DataProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064048(187-202)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3064048
Yang XWang YWang BWang WSellis TDavidson SIves Z(2015)Local FilteringProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2749445(377-392)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2749445
Nothaft FMassie MDanford TZhang ZLaserson UYeksigian CKottalam JAhuja AHammerbacher JLinderman MFranklin MJoseph APatterson DSellis TDavidson SIves Z(2015)Rethinking Data-Intensive Science Using Scalable Analytics SystemsProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2742787(631-646)Online publication date: 27-May-2015
https://dl.acm.org/doi/10.1145/2723372.2742787
Lu MLuo Q(2014)Accelerating Large-Scale Genome-Wide Association Studies with Graphics ProcessorsBig Data Management, Technologies, and Applications10.4018/978-1-4666-4699-5.ch014(349-380)Online publication date: 2014
https://doi.org/10.4018/978-1-4666-4699-5.ch014
Wandelt SDeng DGerdjikov SMishra SMitankin PPatil MSiragusa ETiskin AWang WWang JLeser U(2014)State-of-the-art in string similarity search and joinACM SIGMOD Record10.1145/2627692.262770643:1(64-76)Online publication date: 13-May-2014
https://dl.acm.org/doi/10.1145/2627692.2627706
Chen YSchmidt BMaskell D(2013)A hybrid short read mapping acceleratorBMC Bioinformatics10.1186/1471-2105-14-6714:1Online publication date: 26-Feb-2013
https://doi.org/10.1186/1471-2105-14-67
Qin JWang WXiao CLu YLin XWang H(2013)Asymmetric signature schemes for efficient exact edit similarity query processingACM Transactions on Database Systems10.1145/2508020.250802338:3(1-44)Online publication date: 5-Sep-2013
https://dl.acm.org/doi/10.1145/2508020.2508023
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents