[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2723372.2750549acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Data X-Ray: A Diagnostic Tool for Data Errors

Published: 27 May 2015 Publication History

Abstract

A lot of systems and applications are data-driven, and the correctness of their operation relies heavily on the correctness of their data. While existing data cleaning techniques can be quite effective at purging datasets of errors, they disregard the fact that a lot of errors are systematic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, in this paper we focus on data diagnosis: explaining where and how the errors happen in a data generative process.
We develop a large-scale diagnostic framework called DATA X-RAY. Our contributions are three-fold. First, we transform the diagnosis problem to the problem of finding common properties among erroneous elements, with minimal domain-specific assumptions. Second, we use Bayesian analysis to derive a cost model that implements three intuitive principles of good diagnoses. Third, we design an efficient, highly-parallelizable algorithm for performing data diagnosis on large-scale data. We evaluate our cost model and algorithm using both real-world and synthetic data, and show that our diagnostic framework produces better diagnoses and is orders of magnitude more efficient than existing techniques.

References

[1]
S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Siméon, and S. Zohar. Tools for data translation and integration. IEEE Data Engineering Bulletin, 22(1):3--8, 1999.
[2]
Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstick on pig: Enabling database-style workflow provenance. PVLDB, 5(4):346--357, Dec. 2011.
[3]
P. Arabie, J. D. Carroll, W. S. DeSarbo, and J. Wind. Overlapping clustering: A new method for product positioning. Journal of Marketing Research, 18:310--317, 1981.
[4]
A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J. Mooney. Model-based overlapping clustering. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD, pages 532--537, New York, NY, USA, 2005. ACM.
[5]
C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv., 18(4):323--364, Dec. 1986.
[6]
G. Bender, L. Kot, and J. Gehrke. Explainable security for relational databases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 1411--1422, New York, NY, USA, 2014. ACM.
[7]
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, Sept. 2010.
[8]
J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss minimization. In International Conference on Machine Learning (ICML), Bellevue, Washington, June 2011.
[9]
P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001.
[10]
R. D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover problem. In In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345--353, 2000.
[11]
J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009.
[12]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469. IEEE Computer Society, 2013.
[13]
V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233--235, 1979.
[14]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, pages 315--326. VLDB Endowment, 2007.
[15]
Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems, 25(2):179--227, 2000.
[16]
M. Cygan, L. Kowalik, and M. Wykurz. Exponential-time approximation of weighted set cover. Information Processing Letters, 109(16):957--961, July 2009.
[17]
M. Das, S. Amer-Yahia, G. Das, and C. Yu. Mri: Meaningful interpretations of collaborative ratings. PVLDB, 4(11):1063--1074, 2011.
[18]
A. Das Sarma, A. Jain, and D. Srivastava. I4e: Interactive investigation of iterative information extraction. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 795--806, New York, NY, USA, 2010. ACM.
[19]
S. B. Davidson and J. Freire. Provenance and scientific workflows: Challenges and opportunities. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 1345--1350, New York, NY, USA, 2008. ACM.
[20]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2014.
[21]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.
[22]
X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009.
[23]
D. Fabbri and K. LeFevre. Explanation-based auditing. PVLDB, 5(1):1--12, Sept. 2011.
[24]
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.
[25]
W. Fan, F. Geerts, and X. Jia. A revival of integrity constraints for data cleaning. Proc. VLDB Endow., 1(2):1522--1523, Aug. 2008.
[26]
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems, 33(2):6:1--6:48, June 2008.
[27]
T. A. Feo and M. G. C. Resende. A probabilistic heuristic for a computationally difficult set covering problem. Operations Research Letters, 8(2):67--71, Apr. 1989.
[28]
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: An extensible data cleaning tool. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD, page 590, New York, NY, USA, 2000. ACM.
[29]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1979.
[30]
K. E. Gebaly, P. Agrawal, L. Golab, F. Korn, and D. Srivastava. Interpretable and informative explanations of outcomes. PVLDB, 8(1):61--72, 2014.
[31]
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, Aug. 2008.
[32]
L. Golab, H. J. Karloff, F. Korn, and D. Srivastava. Data auditor: Exploring data quality and semantics using pattern tableaux. PVLDB, 3(2):1641--1644, 2010.
[33]
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007.
[34]
A. Gruenheid, X. L. Dong, and D. Srivastava. Incremental record linkage. PVLDB, 7(9):697--708, 2014.
[35]
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10--18, Nov. 2009.
[36]
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. 2012.
[37]
D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31(2):716--767, June 2006.
[38]
N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: debugging mapreduce job performance. Proc. VLDB Endow., 5(7):598--609, Mar. 2012.
[39]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: Similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD '06, pages 802--803, New York, NY, USA, 2006. ACM.
[40]
J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009.
[41]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning and data mining in the cloud. PVLDB, 5(8):716--727, Apr. 2012.
[42]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010.
[43]
A. Meliou, W. Gatterbauer, J. Halpern, C. Koch, K. F. Moore, and D. Suciu. Causality in databases. IEEE Data Engineering Bulletin, Sept. 2010.
[44]
A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. The complexity of causality and responsibility for query answers and non-answers. PVLDB, 4(1):34--45, 2010.
[45]
A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD Conference, pages 505--516, 2011.
[46]
A. Meliou, S. Roy, and D. Suciu. Causality and explanations in databases. PVLDB, 7(13):1715--1716, 2014.
[47]
K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
[48]
A. Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In In ICML, 2004.
[49]
C. Parent and S. Spaccapietra. Issues and approaches of database integration. Commununications of the ACM, 41(5):166--178, 1998.
[50]
D. Peleg. Approximation algorithms for the label-covermax and red-blue set cover problems. Journal of Discrete Algorithms, (1):55--64, March 2007.
[51]
J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81--106, Mar. 1986.
[52]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334--350, Dec. 2001.
[53]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.
[54]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 381--390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[55]
S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD, pages 1579--1590, New York, NY, USA, 2014. ACM.
[56]
S. Thirumuruganathan, M. Das, S. Desai, S. Amer-Yahia, G. Das, and C. Yu. Maprat: meaningful explanation, interactive exploration and geo-visualization of collaborative ratings. PVLDB, 5(12):1986--1989, Aug. 2012.
[57]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267--288, 1996.
[58]
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL, pages 173--180, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
[59]
US Department of Transportation, Federal Highway Administration. Freeway incident and weather data. http://portal.its.pdx.edu/Portal/index.php/fhwa.
[60]
E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, June 2013.
[61]
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, Feb. 2011.
[62]
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In KDD, pages 1048--1052, New York, NY, USA, 2007. ACM.
[63]
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012.

Cited By

View all
  • (2024)Heterogeneous Views and Spatial Structure Enhancement for triple error detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124938256:COnline publication date: 5-Dec-2024
  • (2023)Summarizing Provenance of Aggregate Query Results in Relational DatabasesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326584035:10(10695-10709)Online publication date: 1-Oct-2023
  • (2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
  • Show More Cited By

Index Terms

  1. Data X-Ray: A Diagnostic Tool for Data Errors

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. data profiling
    3. error diagnosis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)233
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Heterogeneous Views and Spatial Structure Enhancement for triple error detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124938256:COnline publication date: 5-Dec-2024
    • (2023)Summarizing Provenance of Aggregate Query Results in Relational DatabasesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326584035:10(10695-10709)Online publication date: 1-Oct-2023
    • (2023)TSExplain: Explaining Aggregated Time Series by Surfacing Evolving Contributors2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00060(708-720)Online publication date: Apr-2023
    • (2022)Towards causal physical error discovery in video analytics systemsProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547495(1-6)Online publication date: 12-Jun-2022
    • (2022)DataPrism: Exposing Disconnect between Data and SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517864(217-231)Online publication date: 10-Jun-2022
    • (2022)Complaint-Driven Training Data Debugging at Interactive SpeedsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517849(369-383)Online publication date: 10-Jun-2022
    • (2022)Knowledge Graph Quality Management: a Comprehensive SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3150080(1-1)Online publication date: 2022
    • (2022)Data Management for Machine Learning: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3148237(1-1)Online publication date: 2022
    • (2022)BugDocThe VLDB Journal10.1007/s00778-022-00733-532:1(75-101)Online publication date: 23-Feb-2022
    • (2022)CIRS: A Confidence Interval Radius Slope Method for Time Series Points Based on Unsupervised LearningData Science10.1007/978-981-19-5194-7_23(310-325)Online publication date: 10-Aug-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media