More Web Proxy on the site http://driver.im/

research-article

Sampling dirty data for matching attributes

Authors:

Henning Köhler,

Kerry TaylorAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 63 - 74

https://doi.org/10.1145/1807167.1807177

Published: 06 June 2010 Publication History

Abstract

We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often 'dirty', especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.

References

[1]

A. Y. Halevy, A. Rajaraman, and J. J. Ordille, "Data integration: The teenage years," in VLDB, 2006, pp. 9--16.

Digital Library

[2]

E. Rahm and P. A. Bernstein, A survey of approaches to automatic schema matching, VLDB J., vol. 10, no. 4, pp. 334--350, 2001.

Digital Library

[3]

M. J. Franklin, A. Y. Halevy, and D. Maier, From databases to dataspaces: a new abstraction for information management, SIGMOD Record, vol. 34, no. 4, pp. 27--33, 2005.

Digital Library

[4]

A. Y. Halevy, M. J. Franklin, and D. Maier, Principles of dataspace systems, in PODS, 2006, pp. 1--9.

Digital Library

[5]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk, Mining database structure; or, how to build a data quality browser, in SIGMOD, 2002, pp. 240--251.

Digital Library

[6]

A. Broder, On the resemblance and containment of documents, in SEQUENCES: Proceedings of the Compression and Complexity of Sequences. IEEE Computer Society, 1997, p. 21.

Digital Library

[7]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, Syntactic clustering of the web, Computer Networks, vol. 29, no. 8-13, pp. 1157--1166, 1997.

Digital Library

[8]

U. Manber, Finding similar files in a large file system, in USENIX Winter, 1994, pp. 1--10.

Digital Library

[9]

A. Z. Broder, Identifying and filtering near-duplicate documents, in CPM, 2000, pp. 1--10.

Digital Library

[10]

C. E. Shannon, A Mathematical Theory of Communication. CSLI Publications, 1948.

Digital Library

[11]

C. Li, B. Wang, and X. Yang, Vgram: Improving performance of approximate queries on string collections using variable-length grams, in VLDB, 2007, pp. 303--314.

Digital Library

[12]

J. Bauckmann, U. Leser, F. Naumann, and V. Tietz, Efficiently detecting inclusion dependencies, in ICDE, 2007, pp. 1448--1450.

[13]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey, IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1--16, 2007.

[14]

D. Barbar'a, W. Dumouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik, The New Jersey data reduction report, IEEE Data Engineering Bulletin, vol. 20, pp. 3--45, 1997.

[15]

F. Olken and D. Rotem, Random sampling from databases - a survey, Statistics and Computing, vol. 5, pp. 25--42, 1994.

[16]

R. J. Miller, L. M. Haas, and M. A. Hernáandez, Schema mapping as query discovery, in VLDB, 2000, pp. 77--88.

Digital Library

[17]

R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos, iMAP: Discovering complex mappings between database schemas, in SIGMOD, 2004, pp. 383--394.

Digital Library

[18]

B. T. Dai, N. Koudas, D. Srivastava, A. K. H. Tung, and S. Venkatasubramanian, Validating multi-column schema matchings by type, in ICDE, 2008, pp. 120--129.

Digital Library

[19]

W. W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, in SIGMOD, 1998, pp. 201--212.

Digital Library

[20]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, Approximate string joins in a database (almost) for free, in VLDB, 2001, pp. 491--500.

Digital Library

[21]

F. Olken and D. Rotem, Simple random sampling from relational databases, in VLDB, 1986, pp. 160--169.

Digital Library

[22]

S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, Bifocal sampling for skew-resistant join size estimation, in SIGMOD, 1996, pp. 271--281.

Digital Library

[23]

S. Chaudhuri, R. Motwani, and V. R. Narasayya, On random sampling over joins, in SIGMOD, 1999, pp. 263--274.

Digital Library

[24]

S. Acharya, P. B. Gibbons, and V. Poosala, Congressional samples for approximate answering of group-by queries, in SIGMOD, 2000, pp. 487--498.

Digital Library

[25]

S. Chaudhuri, G. Das, and U. Srivastava, Effective use of block-level sampling in statistics estimation, in SIGMOD Conf., 2004, pp. 287--298.

Digital Library

[26]

P. J. Haas and C. Koenig, A bi-level Bernoulli scheme for database sampling, in SIGMOD, 2004, pp. 275--286.

Digital Library

[27]

J. Gryz, J. Guo, L. Liu, and C. Zuzarte, Query sampling in DB2 universal database, in SIGMOD, 2004, pp. 839--843.

Digital Library

Cited By

Koehler HLink S(2024)Entity/Relationship Profiling2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00411(5393-5396)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00411
Wu FSong HYin JGao LBaldi MAnand N(2021)NEMA: Automatic Integration of Large Network Management DatabasesIEEE Transactions on Network and Service Management10.1109/TNSM.2020.303641418:3(3783-3797)Online publication date: Sep-2021
https://doi.org/10.1109/TNSM.2020.3036414
Ding GSun SWang G(2020)Schema matching based on SQL statementsDistributed and Parallel Databases10.1007/s10619-019-07268-938:1(193-226)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1007/s10619-019-07268-9
Show More Cited By

Index Terms

Sampling dirty data for matching attributes
1. Information systems
  1. Data management systems
    1. Database design and models

Recommendations

Schema matching based on position of attribute in query statement

Attribute-level schema matching is a critical step in numerous database applications, such as DataSpaces, Ontology Merging and Schema Integration. There exist many researches on this topic, however, they all ignore evidences about the positions of ...
Schema matching based on SQL statements
Abstract
Schema matching is a critical step in numerous database applications such as web data sources integrating, data warehouse loading and information exchanging among several authorities. In this paper, we propose to exploit the similarities of the ...
Feature-preserved sampling over streaming data

In this article, we explore a novel sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows. The sampling quality we consider refers to the degree of consistency between the sample ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Results Reproduced / v1.1

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
741
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Koehler HLink S(2024)Entity/Relationship Profiling2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00411(5393-5396)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00411
Wu FSong HYin JGao LBaldi MAnand N(2021)NEMA: Automatic Integration of Large Network Management DatabasesIEEE Transactions on Network and Service Management10.1109/TNSM.2020.303641418:3(3783-3797)Online publication date: Sep-2021
https://doi.org/10.1109/TNSM.2020.3036414
Ding GSun SWang G(2020)Schema matching based on SQL statementsDistributed and Parallel Databases10.1007/s10619-019-07268-938:1(193-226)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1007/s10619-019-07268-9
He PHe JYao HLi PJi Y(2019)Application of Data Distribution Technology in Smart CitiesProcedia Computer Science10.1016/j.procs.2019.11.291162:C(324-330)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1016/j.procs.2019.11.291
Zhang JTay Y(2016)DscalerProceedings of the VLDB Endowment10.14778/3007328.30073339:14(1671-1682)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.14778/3007328.3007333
Ding GSun T(2015)Schema matching based on position of attribute in query statementKnowledge-Based Systems10.1016/j.knosys.2014.11.00575:C(41-51)Online publication date: 1-Feb-2015
https://dl.acm.org/doi/10.1016/j.knosys.2014.11.005
Wang H(2014)Duplicate Record Detection for Data IntegrationInnovative Techniques and Applications of Entity Resolution10.4018/978-1-4666-5198-2.ch014(339-358)Online publication date: 2014
https://doi.org/10.4018/978-1-4666-5198-2.ch014
Chen DKonrad CYi KYu WZhang QDyreson CLi FÖzsu M(2014)Robust set reconciliationProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2610528(135-146)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2588555.2610528
Munir SKhan FRiaz M(2014)An instance based schema matching between opaque database schemas2014 4th International Conference on Engineering Technology and Technopreneuship (ICE2T)10.1109/ICE2T.2014.7006242(177-182)Online publication date: Aug-2014
https://doi.org/10.1109/ICE2T.2014.7006242
Zhu JYang YXie QWang LHassan S(2014)Robust hybrid name disambiguation framework for large databasesScientometrics10.1007/s11192-013-1151-098:3(2255-2274)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1007/s11192-013-1151-0
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents