More Web Proxy on the site http://driver.im/

research-article

Public Access

HoloDetect: Few-Shot Learning for Error Detection

Authors:

Alireza Heidari,

Joshua McGrath,

Theodoros RekatsinasAuthors Info & Claims

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

Pages 829 - 846

https://doi.org/10.1145/3299869.3319888

Published: 25 June 2019 Publication History

Abstract

We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.

References

[1]

Ziawasch Abedjan, Cuneyt Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal Rules Discovery for Web Data Cleaning. PVLDB, Vol. 9, 4 (2015), 336 --347.

Digital Library

[2]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016a. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 993--1004.

Digital Library

[3]

Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Paolo Papotti, Mourad Ouzzani, and Michael Stonebraker. 2016b. DataXFormer: A Robust Transformation Discovery System. In ICDE .

[4]

P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. 2015. Messing-Up with BART: Error Generation for Evaluating Data Cleaning Algorithms . PVLDB, Vol. 9, 2 (2015), 36--47.

Digital Library

[5]

Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 35, 8 (Aug. 2013), 1798--1828.

Digital Library

[6]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res., Vol. 3 (March 2003), 1137--1155.

Digital Library

[7]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, Vol. 5 (2017), 135--146.

[8]

Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. 2004. Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, Vol. 6, 1 (2004), 1--6.

Digital Library

[9]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS 2016). 7--10.

Digital Library

[10]

Jan Chomicki and Jerzy Marcinkowski. 2005. Minimal-change Integrity Maintenance Using Tuple Deletions. Inf. Comput., Vol. 197, 1--2 (Feb. 2005), 90--121.

Digital Library

[11]

Xu Chu, Ihab F Ilyas, and Paolo Papotti. 2013a. Discovering denial constraints. PVLDB, Vol. 6, 13 (2013), 1498--1509.

Digital Library

[12]

X. Chu, I. F. Ilyas, and P. Papotti. 2013b. Holistic data cleaning: Putting violations into context. In ICDE. 458--469.

Digital Library

[13]

Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1247--1261.

Digital Library

[14]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. AutoAugment: Learning Augmentation Policies from Data. arXiv preprint arXiv:1805.09501 (2018).

[15]

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD. ACM, 541--552.

[16]

Tamraparni Dasu and Ji Meng Loh. 2012. Statistical Distortion: Consequences of Data Cleaning. PVLDB, Vol. 5, 11 (2012), 1674--1683.

Digital Library

[17]

AnHai Doan, Alon Y. Halevy, and Zachary G. Ives. 2012. Principles of Data Integration .Morgan Kaufmann.

Digital Library

[18]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transactions on Data Engineering, Vol. 19, 1 (2007).

Digital Library

[19]

Seyda Ertekin, Jian Huang, and C. Lee Giles. 2007. Active Learning for Class Imbalance Problem (SIGIR '07). ACM, New York, NY, USA, 823--824.

Digital Library

[20]

W. Fan and F. Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool.

Digital Library

[21]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. The VLDB journal, Vol. 21, 2 (2012), 213--238.

Digital Library

[22]

Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. 2007. Euclidean Embedding of Co-occurrence Data. JMLR, Vol. 8 (Dec. 2007), 2265--2295.

Digital Library

[23]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680.

Digital Library

[24]

Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning .MIT Press, Cambridge, MA, USA.

Digital Library

[25]

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6--11 August 2017 . 1321--1330.

Digital Library

[26]

Haibo He and Yunqian Ma. 2013. Imbalanced Learning: Foundations, Algorithms, and Applications 1st ed.). Wiley-IEEE Press.

[27]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[28]

Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) (2008).

[29]

G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA, USA, Chapter Distributed Representations, 77--109.

Digital Library

[30]

Zhipeng Huang and Yeye He. 2018. Auto-Detect: Data-Driven Error Detection in Tables. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018. 1377--1392.

Digital Library

[31]

Ihab F. Ilyas and Xu Chu. 2015. Trends in Cleaning Relational Data: Consistency and Deduplication. Foundations and Trends in Databases, Vol. 5, 4 (2015), 281--393.

Digital Library

[32]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).

[33]

Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372.

Digital Library

[34]

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A System for Big Data Cleansing. In SIGMOD. 1215--1230.

Digital Library

[35]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models. In AAAI. 2741--2749.

Digital Library

[36]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR, Vol. abs/1412.6980 (2014).

[37]

Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT .

Digital Library

[38]

Rémi Lebret and Ronan Collobert. 2014. Word Embeddings through Hellinger PCA. EACL.

[39]

Xinghua Lu, Bin Zheng, Atulya Velivelli, and ChengXiang Zhai. 2006. Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, Vol. 13, 5 (2006), 526--535.

[40]

Tomas Mikolov, Ilya Sutskever, Kai Chen, et almbox. 2013. Distributed Representations of Words and Phrases and Their Compositionality. NIPS.

Digital Library

[41]

Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. CoRR, Vol. abs/1411.1784 (2014).

[42]

Stefan Uhlich; Marcello Porcu; Franck Giron; Michael Enenkl; Thomas Kemp; Naoya Takahashi; Yuki Mitsufuji. 2017. Improving Music Source Separation based on DNNs through Data Augmentation and Network Blending. (2017).

[43]

Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection .Morgan & Claypool Publishers.

Digital Library

[44]

Jason W Osborne. 2013. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data .Sage.

[45]

Luis Perez and Jason Wang. 2017. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. CoRR, Vol. abs/1712.04621 (2017).

[46]

J. Platt. 2000. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classifiers .

[47]

Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J Miller, and Divesh Srivastava. 2015. Combining quantitative and logical data cleaning. PVLDB, Vol. 9, 4 (2015), 300--311.

Digital Library

[48]

Erhard Rahm and Hong-Hai Do. 2000. Data Cleaning: Problems and Current Approaches. DE, Vol. 23(4) (2000), 3--13.

[49]

Joeri Rammelaere and Floris Geerts. 2018. Explaining Repaired Data with CFDs. Proc. VLDB Endow., Vol. 11, 11 (July 2018), 1387--1399.

Digital Library

[50]

Joeri Rammelaere, Floris Geerts, and Bart Goethals. 2017. Cleaning data with forbidden itemsets. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 897--908.

[51]

John W. Ratcliff and David E. Metzener. 1988. Pattern Matching: The Gestalt Approach. Dr. Dobb's Journal of Software Tools, Vol. 13, 7 (July 1988), 46, 47, 59--51, 68--72.

[52]

Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, Vol. 11, 3 (2017), 269--282.

Digital Library

[53]

Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. 2017b. Learning to Compose Domain-Specific Transformations for Data Augmentation, See citeNDBLP:conf/nips/RatnerEHDR17, 3239--3249.

Digital Library

[54]

Christopher Ré. 2018. Software 2.0 and Snorkel: Beyond Hand-Labeled Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2876--2876.

Digital Library

[55]

Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment, Vol. 10, 11 (2017), 1190--1201.

Digital Library

[56]

Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. 2019. A Formal Framework for Probabilistic Unclean Databases (ICDT).

[57]

Burr Settles. 2012. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Vol. 6, 1 (2012), 1--114.

Digital Library

[58]

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387 (2015).

[59]

Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stan Zdonik, Alexander Pagan, and Shan Xu. 2013. Data Curation at Scale: The Data Tamer System. In CIDR .

[60]

Jiannan Wang and Nan Tang. 2014. Towards dependable data repairing with fixing rules. In SIGMOD. 457--468.

Digital Library

[61]

Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining Away Outliers in Aggregate Queries. PVLDB, Vol. 6, 8 (June 2013), 553--564.

Digital Library

[62]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. https://arxiv.org/abs/1611.03530

[63]

Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. 2016. Highway long short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5755--5759.

[64]

Xiaojin Zhu. 2007. Semi-supervised learning tutorial. In International Conference on Machine Learning (ICML). 1--135.

Cited By

Ding XSong YWang HWang CYang D(2025)MTSClean: Efficient Constraint-Based Cleaning for Multi-Dimensional Time Series DataProceedings of the VLDB Endowment10.14778/3704965.370498717:13(4840-4852)Online publication date: 18-Feb-2025
https://doi.org/10.14778/3704965.3704987
Singh MCambronero JGulwani SLe VNegreanu CRadhakrishna AVerbruggen G(2025)DataVinci: Learning Syntactic and Semantic String RepairsProceedings of the ACM on Management of Data10.1145/37096773:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709677
Bian SOuyang XFan ZKoutris PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692227
Show More Cited By

Index Terms

HoloDetect: Few-Shot Learning for Error Detection

Recommendations

SPL-LDP: a label distribution propagation method for semi-supervised partial label learning
Abstract
Partial label learning learns from examples represented by a single instance while associated with multiple candidate labels, among which only one valid label resides. However, in real-world applications, collecting candidate label sets for all ...
Few-shot Node Classification with Extremely Weak Supervision
WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

Few-shot node classification aims at classifying nodes with limited labeled nodes as references. Recent few-shot node classification methods typically learn from classes with abundant labeled nodes (i.e., meta-training classes) and then generalize to ...
Robust Graph Meta-Learning for Weakly Supervised Few-Shot Node Classification
Graph machine learning (Graph ML) models typically require abundant labeled instances to provide sufficient supervision signals, which is commonly infeasible in real-world scenarios since labeled data for newly emerged concepts (e.g., new categorizations ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

June 2019

2106 pages

ISBN:9781450356435

DOI:10.1145/3299869

General Chairs:
Peter Boncz
CWI & Vrije Universiteit Amsterdam, The Netherlands
,
Stefan Manegold
CWI & Universiteit Leiden, The Netherlands
,
Program Chairs:
Anastasia Ailamaki
EPFL, Switzerland
,
Amol Deshpande
University of Maryland, USA
,
Tim Kraska
MIT, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

June 30 - July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

SIGMOD '19 Paper Acceptance Rate 88 of 430 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

78
Total Citations
View Citations
2,198
Total Downloads

Downloads (Last 12 months)437
Downloads (Last 6 weeks)59

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ding XSong YWang HWang CYang D(2025)MTSClean: Efficient Constraint-Based Cleaning for Multi-Dimensional Time Series DataProceedings of the VLDB Endowment10.14778/3704965.370498717:13(4840-4852)Online publication date: 18-Feb-2025
https://doi.org/10.14778/3704965.3704987
Singh MCambronero JGulwani SLe VNegreanu CRadhakrishna AVerbruggen G(2025)DataVinci: Learning Syntactic and Semantic String RepairsProceedings of the ACM on Management of Data10.1145/37096773:1(1-26)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709677
Bian SOuyang XFan ZKoutris PSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Naive Bayes classifiers over missing dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692227(3913-3934)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692227
Reis EAbdelaal MBinnig C(2024)Generalizable Data Cleaning of Tabular Data in Latent SpaceProceedings of the VLDB Endowment10.14778/3704965.370498317:13(4786-4798)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704983
Ni WMiao XZhao XWu YLiang SYin J(2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 6-Aug-2024
https://doi.org/10.14778/3675034.3675051
Deng YChai CCao LTang NWang JFan JYuan YWang G(2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648161
Billion Polak PPrusa JKhoshgoftaar T(2024)Low-shot learning and class imbalance: a surveyJournal of Big Data10.1186/s40537-023-00851-z11:1Online publication date: 2-Jan-2024
https://doi.org/10.1186/s40537-023-00851-z
Fan WPang KLu PTian C(2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/3702315Online publication date: 2-Nov-2024
https://doi.org/10.1145/3702315
Yan MWang YWang YMiao XLi J(2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698811
Bachinger FEhrlinger LKronberger GWöss W(2024)Data Validation Utilizing Expert Knowledge and Shape ConstraintsJournal of Data and Information Quality10.1145/366182616:2(1-27)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3661826
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten