More Web Proxy on the site http://driver.im/

research-article

Open access

Data X-Ray: A Diagnostic Tool for Data Errors

Authors:

Alexandra MeliouAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 1231 - 1245

https://doi.org/10.1145/2723372.2750549

Published: 27 May 2015 Publication History

Abstract

A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process.

We develop a large-scale diagnostic framework called DATA X-RAY. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.

References

[1]

S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3--8, 1999.

[2]

Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstick on pig: Enabling database-style workflow provenance. PVLDB, 5(4):346--357, Dec. 2011.

Digital Library

[3]

P. Arabie, J. D. Carroll, W. S. DeSarbo, and J. Wind. Overlapping clustering: A new method for product positioning. Journal of Marketing Research, 18:310--317, 1981.

[4]

A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J. Mooney. Model-based overlapping clustering. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD, pages 532--537, New York, NY, USA, 2005. ACM.

Digital Library

[5]

C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv., 18(4):323--364, Dec. 1986.

Digital Library

[6]

G. Bender, L. Kot, and J. Gehrke. Explainable security for relational databases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 1411--1422, New York, NY, USA, 2014. ACM.

Digital Library

[7]

G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, Sept. 2010.

Digital Library

[8]

J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In International Conference on Machine Learning (ICML), Bellevue, Washington, June 2011.

[9]

P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001.

Digital Library

[10]

R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover problem. In In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345--353, 2000.

Digital Library

[11]

J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009.

Digital Library

[12]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469. IEEE Computer Society, 2013.

Digital Library

[13]

V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233--235, 1979.

Digital Library

[14]

G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 315--326. VLDB Endowment, 2007.

Digital Library

[15]

Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems, 25(2):179--227, 2000.

Digital Library

[16]

M. Cygan, L. Kowalik, and M. Wykurz. Exponential-time approximation of weighted set cover. Information Processing Letters, 109(16):957--961, July 2009.

Digital Library

[17]

M. Das, S. Amer-Yahia, G. Das, and C. Yu. Mri: Meaningful interpretations of collaborative ratings. PVLDB, 4(11):1063--1074, 2011.

Digital Library

[18]

A. Das Sarma, A. Jain, and D. Srivastava. I4e: Interactive investigation of iterative information extraction. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 795--806, New York, NY, USA, 2010. ACM.

Digital Library

[19]

S. B. Davidson and J. Freire. Provenance and scientific workflows: Challenges and opportunities. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 1345--1350, New York, NY, USA, 2008. ACM.

Digital Library

[20]

X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2014.

Digital Library

[21]

X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.

Digital Library

[22]

X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009.

Digital Library

[23]

D. Fabbri and K. LeFevre. Explanation-based auditing. PVLDB, 5(1):1--12, Sept. 2011.

Digital Library

[24]

A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.

Digital Library

[25]

W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints for data cleaning. Proc. VLDB Endow., 1(2):1522--1523, Aug. 2008.

Digital Library

[26]

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems, 33(2):6:1--6:48, June 2008.

Digital Library

[27]

T. A. Feo and M. G. C. Resende. A probabilistic heuristic for a computationally difficult set covering problem. Operations Research Letters, 8(2):67--71, Apr. 1989.

Digital Library

[28]

H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: An extensible data cleaning tool. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD, page 590, New York, NY, USA, 2000. ACM.

Digital Library

[29]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979.

Digital Library

[30]

K. E. Gebaly, P. Agrawal, L. Golab, F. Korn, and D. Srivastava. Interpretable and informative explanations of outcomes. PVLDB, 8(1):61--72, 2014.

Digital Library

[31]

L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, Aug. 2008.

Digital Library

[32]

L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. PVLDB, 3(2):1641--1644, 2010.

Digital Library

[33]

T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007.

Digital Library

[34]

A. Gruenheid, X. L. Dong, and D. Srivastava. Incremental record linkage. PVLDB, 7(9):697--708, 2014.

Digital Library

[35]

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10--18, Nov. 2009.

Digital Library

[36]

J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. 2012.

[37]

D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2):716--767, June 2006.

Digital Library

[38]

N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: debugging mapreduce job performance. Proc. VLDB Endow., 5(7):598--609, Mar. 2012.

Digital Library

[39]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD '06, pages 802--803, New York, NY, USA, 2006. ACM.

Digital Library

[40]

J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.

[41]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, Apr. 2012.

Digital Library

[42]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010.

Digital Library

[43]

A. Meliou, W. Gatterbauer, J. Halpern, C. Koch, K. F. Moore, and D. Suciu. Causality in databases. IEEE Data Engineering Bulletin, Sept. 2010.

[44]

A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. The complexity of causality and responsibility for query answers and non-answers. PVLDB, 4(1):34--45, 2010.

Digital Library

[45]

A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD Conference, pages 505--516, 2011.

Digital Library

[46]

A. Meliou, S. Roy, and D. Suciu. Causality and explanations in databases. PVLDB, 7(13):1715--1716, 2014.

Digital Library

[47]

K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

Digital Library

[48]

A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In In ICML, 2004.

Digital Library

[49]

C. Parent and S. Spaccapietra. Issues and approaches of database integration. Commununications of the ACM, 41(5):166--178, 1998.

Digital Library

[50]

D. Peleg. Approximation algorithms for the label-cover_max and red-blue set cover problems. Journal of Discrete Algorithms, (1):55--64, March 2007.

Digital Library

[51]

J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, Mar. 1986.

[52]

E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334--350, Dec. 2001.

Digital Library

[53]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.

[54]

V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 381--390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

Digital Library

[55]

S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 1579--1590, New York, NY, USA, 2014. ACM.

Digital Library

[56]

S. Thirumuruganathan, M. Das, S. Desai, S. Amer-Yahia, G. Das, and C. Yu. Maprat: meaningful explanation, interactive exploration and geo-visualization of collaborative ratings. PVLDB, 5(12):1986--1989, Aug. 2012.

Digital Library

[57]

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267--288, 1996.

[58]

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL, pages 173--180, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.

Digital Library

[59]

US Department of Transportation, Federal Highway Administration. Freeway incident and weather data. http://portal.its.pdx.edu/Portal/index.php/fhwa.

[60]

E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, June 2013.

Digital Library

[61]

M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, Feb. 2011.

Digital Library

[62]

X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In KDD, pages 1048--1052, New York, NY, USA, 2007. ACM.

Digital Library

[63]

B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012.

Digital Library

Cited By

Chen CMa WGao CZhang WZeng KYe TChen YDu X(2025)GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
https://dl.acm.org/doi/10.1007/s11704-024-40624-2
Gao HCai SDinh THuang ZOoi B(2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698831
Xue XZhang CWang YSong HXue XNiu Z(2024)Heterogeneous Views and Spatial Structure Enhancement for triple error detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124938256:COnline publication date: 5-Dec-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124938
Show More Cited By

Index Terms

Data X-Ray: A Diagnostic Tool for Data Errors
1. Information systems
  1. Data management systems

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows,...
ETDC: An Efficient Technique to Cleanse Data in the Data Warehouse
ICAIP '17: Proceedings of the International Conference on Advances in Image Processing

Data cleansing can be considered to be an activity that is performed on the data sets of the data warehouse. The cleansing is done in order to enhance and collectively maintain data consistency and quality. The quality of data has a strong impact on a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
1,693
Total Downloads

Downloads (Last 12 months)235
Downloads (Last 6 weeks)41

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen CMa WGao CZhang WZeng KYe TChen YDu X(2025)GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
https://dl.acm.org/doi/10.1007/s11704-024-40624-2
Gao HCai SDinh THuang ZOoi B(2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698831
Xue XZhang CWang YSong HXue XNiu Z(2024)Heterogeneous Views and Spatial Structure Enhancement for triple error detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124938256:COnline publication date: 5-Dec-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124938
AlOmeir OLai EMilani MPottinger R(2023)Summarizing Provenance of Aggregate Query Results in Relational DatabasesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326584035:10(10695-10709)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TKDE.2023.3265840
Chen YHuang S(2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00060
Shaowang TZhao JSintos SKrishnan S(2022)Towards causal physical error discovery in video analytics systemsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547495(1-6)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3546930.3547495
Galhotra SFariha ALourenço RFreire JMeliou ASrivastava DIves ZBonifati AEl Abbadi A(2022)DataPrism: Exposing Disconnect between Data and SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517864(217-231)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517864
Flokas LWu WLiu YWang JVerma NWu EIves ZBonifati AEl Abbadi A(2022)Complaint-Driven Training Data Debugging at Interactive SpeedsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517849(369-383)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517849
Xue BZou L(2022)Knowledge Graph Quality Management: a Comprehensive SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3150080(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3150080
Chai CWang JLuo YNiu ZLi G(2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3148237
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten