[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICDE.2011.5767856guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Answering approximate string queries on large data sets using external memory

Published: 11 April 2011 Publication History

Abstract

An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICDE '11: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
April 2011
1457 pages
ISBN:9781424489596

Publisher

IEEE Computer Society

United States

Publication History

Published: 11 April 2011

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Open data integrationProceedings of the VLDB Endowment10.14778/3229863.324049111:12(2130-2139)Online publication date: 1-Aug-2018
  • (2018)ZigZagProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196936(873-888)Online publication date: 27-May-2018
  • (2017)A unified framework for string similarity search with edit-distance constraintThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-016-0449-y26:2(249-274)Online publication date: 1-Apr-2017
  • (2015)An effective candidate generation method for improving performance of edit similarity query processingInformation Systems10.1016/j.is.2014.07.00547:C(116-128)Online publication date: 1-Jan-2015
  • (2014)String similarity joinsProceedings of the VLDB Endowment10.14778/2732296.27322997:8(625-636)Online publication date: 1-Apr-2014
  • (2013)Asymmetric signature schemes for efficient exact edit similarity query processingACM Transactions on Database Systems10.1145/2508020.250802338:3(1-44)Online publication date: 5-Sep-2013
  • (2013)A partition-based method for string similarity joins with edit-distance constraintsACM Transactions on Database Systems10.1145/2487259.248726138:2(1-33)Online publication date: 4-Jul-2013
  • (2013)Efficient parallel partition-based algorithms for similarity search and join with edit distance constraintsProceedings of the Joint EDBT/ICDT 2013 Workshops10.1145/2457317.2457382(341-348)Online publication date: 18-Mar-2013
  • (2012)ASTERIXProceedings of the Ninth International Workshop on Information Integration on the Web10.1145/2331801.2331803(1-4)Online publication date: 20-May-2012
  • (2012)Efficient range queries over uncertain stringsProceedings of the 24th international conference on Scientific and Statistical Database Management10.1007/978-3-642-31235-9_5(75-95)Online publication date: 25-Jun-2012

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media