[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Temporal rules discovery for web data cleaning

Published: 01 December 2015 Publication History

Abstract

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts, in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail.
In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

References

[1]
Z. Abedjan, P. Schulze, and F. Naumann. DFD: efficient functional dependency discovery. In CIKM, pages 949--958, 2014.
[2]
B. Alexe, M. Roth, and W.-C. Tan. Preference-aware integration of temporal data. PVLDB, 8(4):365--376, 2014.
[3]
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1):197--207, 2010.
[4]
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.
[5]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.
[6]
G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In GSCL, pages 31--40, 2009.
[7]
M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805--816, 2013.
[8]
F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008.
[9]
Y.-H. Chiang, A. Doan, and J. F. Naughton. Modeling entity evolution for temporal record matching. In SIGMOD, pages 1175--1186, 2014.
[10]
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.
[11]
C. Combi, P. Parise, P. Sala, and G. Pozzi. Mining approximate temporal functional dependencies based on pure temporal grouping. In ICDMW, pages 258--265, 2013.
[12]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: A Commodity Data Cleaning System. In SIGMOD, pages 541--552, 2013.
[13]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 7(10):881--892, 2014.
[14]
W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE TKDE, 23(5):683--698, 2011.
[15]
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, pages 131--140, 2010.
[16]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013.
[17]
A. Gelman and J. Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge U. Press, 2006.
[18]
M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier Detection for Temporal Data. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2014.
[19]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J., 42(2):100--111, 1999.
[20]
I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658, 2004.
[21]
C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Extending existing dependency theory to temporal databases. IEEE Trans. Knowl. Data Eng., 8(4):563--582, 1996.
[22]
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015.
[23]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. In ICDT, pages 129--149, 1995.
[24]
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.
[25]
F. Li, M. Lee, W. Hsu, and W. Tan. Linking temporal records for profiling entities. In SIGMOD, pages 593--605, 2015.
[26]
H. Li and N. Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11(5):473--483, 2010.
[27]
P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(11):956--967, 2011.
[28]
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012.
[29]
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011.
[30]
S. Truvé. A white paper on temporal analytics. www.recordedfuture.com/assets/RF-White-Paper.pdf.
[31]
S. Song, L. Chen, and H. Cheng. Efficient determination of distance thresholds for differential dependencies. IEEE Trans. Knowl. Data Eng., 26(9):2179--2192, 2014.
[32]
C. M. Wyss, C. Giannella, and E. L. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In DaWaK, pages 101--110, 2001.
[33]
B. Zhao, B. I. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012.

Cited By

View all
  • (2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
  • (2024)SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical AttributesMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70359-1_13(213-229)Online publication date: 8-Sep-2024
  • (2023)Exploratory Training: When Annotators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
  • Show More Cited By
  1. Temporal rules discovery for web data cleaning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 4
    December 2015
    156 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 December 2015
    Published in PVLDB Volume 9, Issue 4

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
    • (2024)SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical AttributesMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70359-1_13(213-229)Online publication date: 8-Sep-2024
    • (2023)Exploratory Training: When Annotators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
    • (2023)Computing Maximal Likelihood Subset Repair for Inconsistent DataWeb and Big Data10.1007/978-981-97-2390-4_1(1-15)Online publication date: 6-Oct-2023
    • (2021)Stream Data Cleaning under Speed and Acceleration ConstraintsACM Transactions on Database Systems10.1145/346574046:3(1-44)Online publication date: 28-Sep-2021
    • (2021)RotomProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457258(1303-1316)Online publication date: 9-Jun-2021
    • (2021)Temporal Dependencies for GraphsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450586(2881-2883)Online publication date: 9-Jun-2021
    • (2020)Pattern functional dependencies for data cleaningProceedings of the VLDB Endowment10.14778/3377369.337737713:5(684-697)Online publication date: 19-Feb-2020
    • (2020)RuleHubJournal of Data and Information Quality10.1145/340938412:4(1-22)Online publication date: 15-Oct-2020
    • (2020)Anatomy of Metadata for Data CurationJournal of Data and Information Quality10.1145/337192512:3(1-30)Online publication date: 13-Jun-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media