[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-319-91458-9_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Source Selection for Inconsistency Detection

Published: 21 May 2018 Publication History

Abstract

Inconsistencies in a database can be detected based on violations of integrity constraints, such as functional depencies (FDs). In big data era, many related data sources give us the chance of detecting inconsistency extensively. That is, even though violations do not exist in a single data set D, we can leverage other data sources to discover potential violations. A significant challenge for violation detection based on data sources is that accessing too many data sources introduces a huge cost, while involving too few data sources may miss serious violations. Motivated by this, we investigate how to select a proper subset of sources for inconsistency detection. To address this problem, we formulate the gain model of sources and introduce the optimization problem of source selection, called SSID, in which the gain is maximized with the cost under a threshold. We show that the SSID problem is NP-hard and propose a greedy approximation approach for SSID. To avoid accessing data sources, we also present a randomized technique for gain estimation with theoretical guarantees. Experimental results on both real and synthetic data show high performance on both effectiveness and efficiency of our algorithm.

References

[1]
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context, pp. 458–469 (2013)
[2]
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, and Yang C Finding interesting associations without support pruning TKDE 2001 13 1 64-78
[3]
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552. ACM (2013)
[4]
Dong XL, Berti-Equille L, and Srivastava D Integrating conflicting data: the role of source dependence VLDB 2009 2 1 550-561
[5]
Dong XL, Berti-Equille L, and Srivastava D Truth discovery and copying detection in a dynamic world VLDB 2009 2 1 562-573
[6]
Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. In: VLDB, vol. 6, pp. 37–48. VLDB Endowment (2012)
[7]
Fan, W.: Dependencies revisited for improving data quality. In: PODS
[8]
Fan W, Geerts F, Jia X, and Kementsietsidis A Conditional functional dependencies for capturing data inconsistencies TODS 2008 33 2 6
[9]
Fan, W., Geerts, F., Ma, S., Müller, H.: Detecting inconsistencies in distributed data. In: ICDE, pp. 64–75. IEEE (2010)
[10]
Fan W, Li J, Tang N, et al. Incremental detection of inconsistencies in distributed data TKDE 2014 26 6 1367-1383
[11]
Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation Algorithms for NP-hard Problems, pp. 94–143. PWS Publishing Co. (1996)
[12]
Li X, Dong XL, Lyons K, Meng W, and Srivastava D Truth finding on the deep web: is the problem solved? VLDB 2012 6 2 97-108
[13]
Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Scaling up copy detection. In: ICDE, pp. 89–100 (2015)
[14]
Liu X, Dong XL, Ooi BC, and Srivastava D Online data fusion VLDB 2011 4 11 932-943
[15]
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: SIGMOD, pp. 75–86. ACM (2010)
[16]
Nemhauser GL, Wolsey LA, and Fisher ML An analysis of approximations for maximizing submodular set functionsi Math. Program. 1978 14 1 265-294
[17]
Phillips, J.M.: Chernoff-hoeffding inequality and applications. arXiv preprint arXiv:1209.6396 (2012)
[18]
Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: SIGMOD, pp. 433–444. ACM (2014)
[19]
Rekatsinas, T., Dong, X.L., Getoor, L., Srivastava, D.: Finding quality in quantity: the challenge of discovering valuable sources for integration. In: CIDR (2015)
[20]
Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930. ACM (2014)
[21]
Salloum M, Dong XL, Srivastava D, and Tsotras VJ Online ordering of overlapping data sources VLDB 2013 7 3 133-144

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Database Systems for Advanced Applications: 23rd International Conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21-24, 2018, Proceedings, Part II
May 2018
844 pages
ISBN:978-3-319-91457-2
DOI:10.1007/978-3-319-91458-9
  • Editors:
  • Jian Pei,
  • Yannis Manolopoulos,
  • Shazia Sadiq,
  • Jianxin Li

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 May 2018

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media