[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3428757.3429129acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
short-paper

Unsupervised Evaluation of Data Integration Processes

Published: 27 January 2021 Publication History

Abstract

Evaluation of the quality of data integration processes is usually performed via manual onerous data inspections. This task is particularly heavy in real business scenarios, where the large amount of data makes checking all the tuples infeasible and the frequent updates, i.e. changes in the sources and/or new sources, impose to repeat the evaluation over and over. Our idea is to address this issue by providing the experts with an unsupervised measure, based on word frequencies, which quantifies how much a dataset is representative of another dataset, giving an indication of how good is the integration process and whether deviations are happening and a manual inspection is needed. We also conducted some preliminary experiments, using shared datasets, that show the effectiveness of the proposed measures in typical data integration scenarios.

References

[1]
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Morgan & Claypool Publishers.
[2]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In SIGMOD, Ahmed K. Elmagarmid and Divyakant Agrawal (Eds.). ACM, 783--794.
[3]
Zhaoqiang Chen, Qun Chen, Fengfeng Fan, Yanyan Wang, Zhuo Wang, Youcef Nafa, Zhanhuai Li, Hailong Liu, and Wei Pan. 2018. Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. In ICDE. 1156--1167.
[4]
Mohamad Dolatshah, Mathew Teoh, Jiannan Wang, and Jian Pei. 2018. Cleaning Crowdsourced Labels Using Oracles For Statistical Classification. Proc. VLDB Endow. 12, 4 (2018), 376--389.
[5]
Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers. https://doi.org/10.2200/S00578ED1V01Y201404DTM040
[6]
Behzad Golshan, Alon Y. Halevy, George A. Mihaila, and Wang-Chiew Tan. 2017. Data Integration: After the Teenage Years. In SIGMOD.
[7]
Francesco Guerra, Paolo Sottovia, Matteo Paganelli, and Maurizio Vincini. 2019. Big Data Integration of Heterogeneous Data Sources: The Re-Search Alps Case Study. In BigData Congress 2019, Milan, Italy.
[8]
Lingli Li, Jianzhong Li, and Hong Gao. 2015. Rule-Based Method for Entity Resolution. IEEE Trans. Knowl. Data Eng. (2015).
[9]
Neil G. Marchant and Benjamin I. P. Rubinstein. 2017. In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling. Proc. VLDB Endow. 10, 11 (2017), 1322--1333. https://doi.org/10.14778/3137628.3137642
[10]
Stefano Ortona, Venkata Vamsikrishna Meduri, and Paolo Papotti. 2018. Robust Discovery of Positive and Negative Rules in Knowledge Bases. In ICDE.
[11]
Matteo Paganelli, Paolo Sottovia, Francesco Guerra, and Yannis Velegrakis. 2019. TuneR: Fine Tuning of Rule-based Entity Matchers. In CIKM.
[12]
Fatemah Panahi, Wentao Wu, AnHai Doan, and Jeffrey F. Naughton. 2017. Towards Interactive Debugging of Rule-based Entity Matching. In EDBT.
[13]
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data Profiling with Metanome. PVLDB (2015).
[14]
Noah A. Smith. 2020. Contextual Word Representations: Putting Words into Computers. Commun. ACM 63, 6 (May 2020), 66--74.
[15]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. Proc. VLDB Endow. 5, 11 (2012), 1483--1494.
[16]
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity Matching: How Similar Is Similar. PVLDB (2011).
[17]
Steven Euijong Whang and Hector Garcia-Molina. 2014. Incremental entity resolution on rules and data. VLDB J. (2014).

Cited By

View all
  • (2021)Efficient Discovery of Functional Dependencies from Incremental DatabasesThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487719(400-409)Online publication date: 29-Nov-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services
November 2020
492 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • Johannes Kepler University, Linz, Austria

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

iiWAS '20

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Efficient Discovery of Functional Dependencies from Incremental DatabasesThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487719(400-409)Online publication date: 29-Nov-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media