[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1989323.1989370acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

WHAM: a high-throughput sequence alignment method

Published: 12 June 2011 Publication History

Abstract

Over the last decade the cost of producing genomic sequences has dropped dramatically due to the current so called "next-gen" sequencing methods. However, these next-gen sequencing methods are critically dependent on fast and sophisticated data processing methods for aligning a set of query sequences to a reference genome using rich string matching models. The focus of this work is on the design, development and evaluation of a data processing system for this crucial "short read alignment" problem. Our system, called WHAM, employs novel hash-based indexing methods and bitwise operations for sequence alignments. It allows richer match models than existing methods and it is significantly faster than the existing state-of-the-art method. In addition, its relative speedup over the existing method is poised to increase in the future in which read sequence lengths will increase. The WHAM code is available at http://www.cs.wisc.edu/wham/.

References

[1]
White House Press Release. Retrieved 2006-07--22. http://www.ornl.gov/sci/techresources/Human-Genome/project/clinton1.shtml.
[2]
M.-S. K. 0002, K.-Y. Whang, J.-G. Lee, and M.-J. Lee. n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In VLDB, pages 325--336, 2005.
[3]
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.
[4]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[5]
R. A. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74--82, 1992.
[6]
L. Ben, T. Cole, P. Mihai, and S. Steven. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology, 10(3):R25, 2009.
[7]
M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Digital SRC Research Report, 1994.
[8]
T. Han, S. Ko, and J. Kang. Efficient Subsequence Matching Using the Longest Common Subsequence with a Dual Match Index. Machine Learning and Data Mining in Pattern Recognition, pages 585--600, 2007.
[9]
E. Karakoc, Z. Ozsoyoglu, S. Sahinalp, M. Tasan, and X. Zhang. Novel approaches to biomolecular sequence indexing. Data Engineering, 1001:40, 2004.
[10]
D. E. Knuth. The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Addison-Wesley Professional, Jan. 2011.
[11]
C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.
[12]
H. Li, J. Ruan, and R. Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research, 18(11):1851, 2008.
[13]
R. Li, Y. Li, K. Kristiansen, and J. Wang. SOAP: short oligonucleotide alignment program. Bioinformatics, 24(5):713, 2008.
[14]
W. Litwin, R. Mokadem, P. Rigaux, and T. J. E. Schwarz. Fast ngram-based string search over data encoded using algebraic signatures. In VLDB, pages 207--218, 2007.
[15]
J. D. McPherson. Next-generation gap. Nature Methods, 6(11s):S2--S5, October 2009.
[16]
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
[17]
P. Papapetrou, V. Athitsos, G. Kollios, and D. Gunopulos. Reference-based alignment in large sequence databases. PVLDB, 2(1):205--216, 2009.
[18]
J. Rao and K. A. Ross. Cache conscious indexing for decision-support in main memory. In VLDB, pages 78--89, 1999.
[19]
R. L. Rivest. Partial-match retrieval algorithms. SIAM J. Comput., 5(1):19--50, 1976.
[20]
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
[21]
Venter, et al. The Sequence of the Human Genome. Science, 291(5507):1304--1351, 2001.
[22]
S. Wu and U. Manber. Fast text searching allowing errors. Commun. ACM, 35(10):83--91, 1992.
[23]
X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In SIGMOD Conference, pages 353--364, 2008.

Cited By

View all
  • (2021)Technology dictates algorithms: recent developments in read alignmentGenome Biology10.1186/s13059-021-02443-722:1Online publication date: 26-Aug-2021
  • (2019)Accelerating Large-Scale Genome-Wide Association Studies With Graphics ProcessorsBiotechnology10.4018/978-1-5225-8903-7.ch017(428-461)Online publication date: 2019
  • (2018)String Similarity Search: A Hash-Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.275693230:1(170-184)Online publication date: 1-Jan-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
June 2011
1364 pages
ISBN:9781450306614
DOI:10.1145/1989323
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate string matching
  2. bit-parallelism
  3. sequence alignment

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Technology dictates algorithms: recent developments in read alignmentGenome Biology10.1186/s13059-021-02443-722:1Online publication date: 26-Aug-2021
  • (2019)Accelerating Large-Scale Genome-Wide Association Studies With Graphics ProcessorsBiotechnology10.4018/978-1-5225-8903-7.ch017(428-461)Online publication date: 2019
  • (2018)String Similarity Search: A Hash-Based ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.275693230:1(170-184)Online publication date: 1-Jan-2018
  • (2017)Massively Parallel Processing of Whole Genome Sequence DataProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064048(187-202)Online publication date: 9-May-2017
  • (2015)Local FilteringProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2749445(377-392)Online publication date: 27-May-2015
  • (2015)Rethinking Data-Intensive Science Using Scalable Analytics SystemsProceedings of the 2015 ACM SIGMOD International Conference on Management of Data10.1145/2723372.2742787(631-646)Online publication date: 27-May-2015
  • (2014)Accelerating Large-Scale Genome-Wide Association Studies with Graphics ProcessorsBig Data Management, Technologies, and Applications10.4018/978-1-4666-4699-5.ch014(349-380)Online publication date: 2014
  • (2014)State-of-the-art in string similarity search and joinACM SIGMOD Record10.1145/2627692.262770643:1(64-76)Online publication date: 13-May-2014
  • (2013)A hybrid short read mapping acceleratorBMC Bioinformatics10.1186/1471-2105-14-6714:1Online publication date: 26-Feb-2013
  • (2013)Asymmetric signature schemes for efficient exact edit similarity query processingACM Transactions on Database Systems10.1145/2508020.250802338:3(1-44)Online publication date: 5-Sep-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media