More Web Proxy on the site http://driver.im/

research-article

Temporal rules discovery for web data cleaning

Editors: Surajit Chaudhuri, Jayant Haritsa Authors:

Ziawasch Abedjan,

Cuneyt G. Akcora,

Mourad Ouzzani,

Michael StonebrakerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 9, Issue 4

Pages 336 - 347

https://doi.org/10.14778/2856318.2856328

Published: 01 December 2015 Publication History

Abstract

Declarative rules, such as functional dependencies, are widely used for cleaning data. Several systems take them as input for detecting errors and computing a "clean" version of the data. To support domain experts, in specifying these rules, several tools have been proposed to profile the data and mine rules. However, existing discovery techniques have traditionally ignored the time dimension. Recurrent events, such as persons reported in locations, have a duration in which they are valid, and this duration should be part of the rules or the cleaning process would simply fail.

In this work, we study the rule discovery problem for temporal web data. Such a discovery process is challenging because of the nature of web data; extracted facts are (i) sparse over time, (ii) reported with delays, and (iii) often reported with errors over the values because of inaccurate sources or non robust extractors. We handle these challenges with a new discovery approach that is more robust to noise. Our solution uses machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself. Our experimental evaluation over real-world data from Recorded Future, an intelligence company that monitors over 700K Web sources, shows that temporal rules improve the quality of the data with an increase of the average precision in the cleaning process from 0.37 to 0.84, and a 40% relative increase in the average F-measure.

References

[1]

Z. Abedjan, P. Schulze, and F. Naumann. DFD: efficient functional dependency discovery. In CIKM, pages 949--958, 2014.

Digital Library

[2]

B. Alexe, M. Roth, and W.-C. Tan. Preference-aware integration of temporal data. PVLDB, 8(4):365--376, 2014.

Digital Library

[3]

G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1):197--207, 2010.

Digital Library

[4]

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154, 2005.

Digital Library

[5]

P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.

[6]

G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In GSCL, pages 31--40, 2009.

[7]

M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805--816, 2013.

Digital Library

[8]

F. Chiang and R. J. Miller. Discovering data quality rules. PVLDB, 1(1):1166--1177, 2008.

Digital Library

[9]

Y.-H. Chiang, A. Doan, and J. F. Naughton. Modeling entity evolution for temporal record matching. In SIGMOD, pages 1175--1186, 2014.

Digital Library

[10]

X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013.

Digital Library

[11]

C. Combi, P. Parise, P. Sala, and G. Pozzi. Mining approximate temporal functional dependencies based on pure temporal grouping. In ICDMW, pages 258--265, 2013.

[12]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: A Commodity Data Cleaning System. In SIGMOD, pages 541--552, 2013.

Digital Library

[13]

X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 7(10):881--892, 2014.

Digital Library

[14]

W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE TKDE, 23(5):683--698, 2011.

Digital Library

[15]

A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In WSDM, pages 131--140, 2010.

Digital Library

[16]

F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013.

Digital Library

[17]

A. Gelman and J. Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge U. Press, 2006.

[18]

M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier Detection for Temporal Data. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2014.

Digital Library

[19]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J., 42(2):100--111, 1999.

[20]

I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658, 2004.

Digital Library

[21]

C. S. Jensen, R. T. Snodgrass, and M. D. Soo. Extending existing dependency theory to temporal databases. IEEE Trans. Knowl. Data Eng., 8(4):563--582, 1996.

Digital Library

[22]

Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015.

Digital Library

[23]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. In ICDT, pages 129--149, 1995.

Digital Library

[24]

S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009.

Digital Library

[25]

F. Li, M. Lee, W. Hsu, and W. Tan. Linking temporal records for profiling entities. In SIGMOD, pages 593--605, 2015.

Digital Library

[26]

H. Li and N. Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11(5):473--483, 2010.

[27]

P. Li, X. L. Dong, A. Maurino, and D. Srivastava. Linking temporal records. PVLDB, 4(11):956--967, 2011.

Digital Library

[28]

X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012.

Digital Library

[29]

J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In IJCAI, pages 2324--2329, 2011.

Digital Library

[30]

S. Truvé. A white paper on temporal analytics. www.recordedfuture.com/assets/RF-White-Paper.pdf.

[31]

S. Song, L. Chen, and H. Cheng. Efficient determination of distance thresholds for differential dependencies. IEEE Trans. Knowl. Data Eng., 26(9):2179--2192, 2014.

[32]

C. M. Wyss, C. Giannella, and E. L. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In DaWaK, pages 101--110, 2001.

Digital Library

[33]

B. Zhao, B. I. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012.

Digital Library

Cited By

Yan MWang YWang YMiao XLi J(2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698811
Zylinski AQahtan A(2024)SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical AttributesMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70359-1_13(213-229)Online publication date: 8-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70359-1_13
Shrestha RHabibelahian OTermehchy APapotti P(2023)Exploratory Training: When Annotators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589280
Show More Cited By

Temporal rules discovery for web data cleaning
1. Information systems
  1. Information systems applications

Recommendations

Discovery of Interesting Association Rules Based on Web Usage Mining
MEDIACOM '10: Proceedings of the 2010 International Conference on Multimedia Communications

Mining of association rules is an important research topic in web usage mining. The purpose of this paper is to research how to dig interesting association rules effectively from the Web logs after been preprocessed. Firstly, using the FP-growth ...
Mining temporal interval relational rules from temporal data

Temporal data mining is still one of important research topic since there are application areas that need knowledge from temporal data such as sequential patterns, similar time sequences, cyclic and temporal association rules, and so on. Although there ...
Discovery of Association Rules in Temporal Databases
ITNG '07: Proceedings of the International Conference on Information Technology

Temporal databases naturally contain a wealth of information that can be unearthed by knowledge discovery and data mining techniques. Discovering association rules in market basket data have been widely studied and many algorithms have been developed. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 9, Issue 4

December 2015

156 pages

ISSN:2150-8097

Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2015

Published in PVLDB Volume 9, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
320
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan MWang YWang YMiao XLi J(2024)GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language ModelsProceedings of the ACM on Management of Data10.1145/36988112:6(1-29)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698811
Zylinski AQahtan A(2024)SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical AttributesMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70359-1_13(213-229)Online publication date: 8-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70359-1_13
Shrestha RHabibelahian OTermehchy APapotti P(2023)Exploratory Training: When Annotators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589280
Zhang AHu SZong CLi JXia X(2023)Computing Maximal Likelihood Subset Repair for Inconsistent DataWeb and Big Data10.1007/978-981-97-2390-4_1(1-15)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-2390-4_1
Song SGao FZhang AWang JYu P(2021)Stream Data Cleaning under Speed and Acceleration ConstraintsACM Transactions on Database Systems10.1145/346574046:3(1-44)Online publication date: 28-Sep-2021
https://dl.acm.org/doi/10.1145/3465740
Miao ZLi YWang XLi GLi ZIdreos SSrivastava D(2021)RotomProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457258(1303-1316)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457258
Alipourlangouri MLi GLi ZIdreos SSrivastava D(2021)Temporal Dependencies for GraphsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450586(2881-2883)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3450586
Qahtan ATang NOuzzani MCao YStonebraker M(2020)Pattern functional dependencies for data cleaningProceedings of the VLDB Endowment10.14778/3377369.337737713:5(684-697)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.14778/3377369.3377377
Ahmadi NTruong TDao LOrtona SPapotti P(2020)RuleHubJournal of Data and Information Quality10.1145/340938412:4(1-22)Online publication date: 15-Oct-2020
https://dl.acm.org/doi/10.1145/3409384
Visengeriyeva LAbedjan Z(2020)Anatomy of Metadata for Data CurationJournal of Data and Information Quality10.1145/337192512:3(1-30)Online publication date: 13-Jun-2020
https://dl.acm.org/doi/10.1145/3371925
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents