More Web Proxy on the site http://driver.im/

research-article

Data Preparation: A Technological Perspective and Review

Authors:

Alvaro A. A. Fernandes,

Martin Koehler,

Nikolaos Konstantinou,

Norman W. Paton,

Rizos SakellariouAuthors Info & Claims

SN Computer Science, Volume 4, Issue 4

https://doi.org/10.1007/s42979-023-01828-8

Published: 02 June 2023 Publication History

Abstract

Data analysis often uses data sets that were collected for different purposes. Indeed, new insights are often obtained by combining data sets that were produced independently of each other, for example by combining data from outside an organization with internal data resources. As a result, there is a need to discover, clean, integrate and restructure data into a form that is suitable for an intended analysis. Data preparation, also known as data wrangling, is the process by which data are transformed from its existing representation into a form that is suitable for analysis. In this paper, we review the state-of-the-art in data preparation, by: (i) describing functionalities that are central to data preparation pipelines, specifically profiling, matching, mapping, format transformation and data repair; and (ii) presenting how these capabilities surface in different approaches to data preparation, that involve programming, writing workflows, interacting with individual data sets as tables, and automating aspects of the process. These functionalities and approaches are illustrated with reference to a running example that combines open government data with web extracted real estate data.

References

[1]

Abedjan Z, Golab L, and Naumann F Profiling relational data: a survey VLDB J 2015 24 4 557-581

[2]

Abedjan Z, Morcos J, Ilyas IF, et al. Dataxformer: a robust transformation discovery system. In: 32nd IEEE International Conference on Data Engineering, ICDE, 2016; pp. 1134–1145,

[3]

Ali SMF and Wrembel R From conceptual design to performance optimization of ETL workflows: current state of research and open problems VLDB J 2017 26 6 777-801

[4]

Arenas M, Barceló P, Libkin L, et al. Foundations of data exchange 2014 Cambridge Cambridge University Press

[5]

Aumueller D, Do HH, Massmann S, et al. Schema and Ontology Matching with COMA++. In: Proceedins of 2005 ACM SIGMOD International Conference on Management of Data. ACM, 2005; 906–8.

[6]

Azarmi B. Talend for Big Data. Packt Publishing 2014.

[7]

Bahri M, Salutari F, Putina A, et al. AutoML: state of the art with a focus on anomaly detection, challenges, and research directions. Int J Data Sci Anal. 2022.

[8]

Bellahsene Z, Bonifati A, Rahm E. Schema Matching and Mapping. 2011.

[9]

Bertossi LE, Geerts F. Data quality and explainable AI. ACM J Data Inf Qual 2020;12(2):11:1–11:9.

[10]

Beskales G, Ilyas IF, Golab L, et al. On the relative trust between inconsistent data and inaccurate constraints. In: 29th IEEE International Conference on Data Engineering, ICDE, 2013; pp. 541–552.

[11]

Bogatu A, Fernandes AAA, Paton NW, et al. Synthedit: Format transformations by example using edit operations. In: 22nd International Conference on Extending Database Technology. OpenProceedings.org, 2019a:714–717.

[12]

Bogatu A, Paton NW, Fernandes AAA, et al. Towards automatic data format transformations: data wrangling at scale Comput J 2019 62 7 1044-1060

[13]

Bogatu A, Fernandes AAA, Paton NW, et al. Dataset discovery in data lakes. In: 36th IEEE International Conference on Data Engineering, ICDE. IEEE, 2020:709–720.

[14]

Bogorny V, Engel PM, Alvares LO. A reuse-based spatial data preparation framework for data mining. In: Proceedings of the 17th International Conference on Software Engineering and Knowledge Engineering (SEKE’2005), Taipei, Taiwan, Republic of China, July 14–16, 2005;649–652.

[15]

Bonfitto S, Casiraghi E, Mesiti M. Table understanding approaches for extracting knowledge from heterogeneous tables. WIREs Data Mining Knowl Discov 2021;11(4).

[16]

Bouman R, van Dongen J. Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL. Wiley Publishing. 2009.

[17]

Cappuzzo R, Papotti P, Thirumuruganathan S. Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proc. 2020 International Conference on Management of Data, SIGMOD. ACM, 2020:1335–49.

[18]

Chapman A, Simperl E, Koesten L, et al. Dataset search: a survey. VLDB J. 2020;29(1):251–72.

[19]

Chiang F, Miller RJ. A unified model for data and constraint repair. In: Proceedings of the 27th International Conference on Data Engineering, ICDE, 2011;446–457.

[20]

Chu X, Morcos J, Ilyas IF, et al. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In: Proc. 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015;1247–61.

[21]

Deng D, Fernandez RC, Abedjan Z, et al. The data civilizer system. In: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research 2017.

[22]

Diaz O, Kushibar K, Osuala R, et al. Data preparation for artificial intelligence in medical imaging: A comprehensive guide to open-access platforms and tools. Physica Med. 2021;83:25–37. https://www.sciencedirect.com/science/article/pii/S1120179721000958

[23]

Doan A, Domingos PM, Halevy AY. Reconciling schemas of disparate data sources: A machine-learning approach. In: Proc. ACM SIGMOD international conference on Management of data, 2001:509–520.

[24]

Doan A, Halevy AY, Ives ZG. Principles of Data Integration. Morgan Kaufmann, 2012. http://research.cs.wisc.edu/dibook/

[25]

Drosos I, Barik T, Guo PJ, et al. Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists. In: CHI ’20: CHI Conference on Human Factors in Computing Systems. ACM, 2020:1–12.

[26]

Elmagarmid AK, Ipeirotis PG, and Verykios VS Duplicate record detection: a survey IEEE Trans Knowl Data Eng 2007 19 1 1-16

[27]

Emani CK, Cullot N, and Nicolle C Understandable big data: a survey Comput Sci Rev 2015 17 70-81

[28]

Fagin R, Kolaitis PG, Miller RJ, et al. Data exchange: semantics and query answering TCS 2005 336 1 89-124

[29]

Fagin R, Haas LM, Hernández M, et al. Clio: Schema mapping creation and data exchange Conceptual Modeling: Foundations and Applications, LNCS 2009 Berlin Springer 198-236

[30]

Fan W, Geerts F. Foundations of Data Quality Management. Morgan & Claypool 2012.

[31]

Ferrara E, Meo PD, Fiumara G, et al. Web data extraction, applications and techniques: a survey Knowl Based Syst 2014 70 301-323

[32]

Fink M, Meilicke C, Stuckenschmidt H. Explaining differences between unaligned table snapshots. In: Proc. 23rd International Conference on Extending Database Technology, EDBT. OpenProceedings.org, 2020:133–144.

[33]

Furche T, Gottlob G, Libkin L, et al. Data wrangling for big data: Challenges and opportunities. In: EDBT, 2016:473–478.

[34]

Gal A. Uncertain Schema Matching. Morgan & Claypool 2011.

[35]

Geerts F, Mecca G, Papotti P, et al. Cleaning data with llunatic VLDB J 2020 29 4 867-892

[36]

van Gennip Y, Hunter B, Ma A, et al. Unsupervised record matching with noisy and incomplete data Int J Data Sci Anal 2018 6 2 109-129

[37]

Gulwani S. Automating string processing in spreadsheets using input-output examples. In: Proc. 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL, 2011:317–330

[38]

Gulwani S, Harris WR, and Singh R Spreadsheet data manipulation using examples Commun ACM 2012 55 8 97-105

[39]

Guo C, Hedeler C, Paton NW, et al. Matchbench: Benchmarking schema matching algorithms for schematic correspondences. In: 29th British National Conference on Databases, BNCOD, 2013:92–106.

[40]

Halevy AY Answering queries using views: a survey VLDBJ 2001 10 4 270-294

[41]

Hameed M and Naumann F Data preparation: a survey of commercial tools SIGMOD Rec 2020 49 3 18-29

[42]

He J, Veltri E, Santoro D, et al. Interactive and deterministic data cleaning: A tossed stone raises a thousand ripples. Proc ACM SIGMOD International Conference on Management of Data 26-June-20 2016:893–907.

[43]

He X, Zhao K, Chu X. Automl: a survey of the state-of-the-art. CoRR abs/1908.00709. 2019 arXiv:1908.00709

[44]

He Y, Jin Z, Chaudhuri S. Auto-transform: learning-to-transform by patterns. Proc VLDB Endow. 2020;13(11):2368–2381. http://www.vldb.org/pvldb/vol13/p2368-he.pdf

[45]

Hellerstein JM, Heer J, Kandel S. Self-service data preparation: Research to practice. IEEE Data Eng Bull 2018a;41(2):23–34. http://sites.computer.org/debull/A18june/p23.pdf

[46]

Hellerstein JM, Heer J, Kandel S. Self-Service Data Preparation: Research to Practice. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 2018b:23–34

[47]

Ilyas IF and Chu X Trends in cleaning relational data: consistency and deduplication Found Trends Datab. 2015 5 4 281-393

[48]

Ioannidis YE. The history of histograms (abridged). In: VLDB. Morgan Kaufmann, 2003:19–30

[49]

Jin Z, Anderson MR, Cafarella MJ, et al. Foofah: Transforming data by example. In: Proc. of the 2017 ACM International Conference on Management of Data, SIGMOD. ACM, 2017:683–698,

[50]

Kandel S, Heer J, Plaisant C, et al. Research directions in data wrangling: Visualizations and transformations for usable and credible data Inf Vis 2011 10 4 271-288

[51]

Kandel S, Paepcke A, Hellerstein J, et al. Wrangler: Interactive visual specification of data transformation scripts. In: CHI, 2011b:3363–3372

[52]

Kazil J, Jarmul K. Data Wrangling with Python: Tips and Tools to Make Your Life Easier, 1st edn. O’Reilly Media, Inc. 2016.

[53]

Kim W, Choi I, Gala SK, et al. On resolving schematic heterogeneity in multidatabase systems Distributed and Parallel Databases 1993 1 3 251-279

[54]

Kluyver T, et al. Jupyter notebooks - a publishing format for reproducible computational workflows. In: Loizides F, Schmidt B (eds) 20th International Conference on Electronic Publishing. IOS Press, 2016:87–90,

[55]

Koehler M, Abel E, Bogatu A, et al. Incorporating data context to cost-effectively automate end-to-end data wrangling IEEE Trans Big Data 2021 7 1 169-186

[56]

Konstantinou N, Koehler M, Abel E, et al. The VADA architecture for cost-effective data wrangling. In: Proc. ACM international conference on management of data, SIGMOD; 2017. p. 1599–602.

[57]

Konstantinou N, Abel E, Bellomarini L, et al. VADA: an architecture for end user informed data preparation J Big Data 2019 6 74

[58]

Kruse S, Papotti P, Naumann F. Estimating data integration and cleaning effort. In: Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015:61–72, https://doi.org/10.5441/002/edbt.2015.07

[59]

Marnette B, Mecca G, Papotti P, et al. ++spicy: an opensource tool for second-generation schema mapping and data exchange PVLDB 2011 4 12 1438-1441

[60]

Maynard-Atem L The data series - data democratisation Impact 2019 2019 1 10-11

[61]

Mazilu L, Paton NW, Fernandes AAA, et al. Schema mapping generation in the wild Inf Syst 2022 104 101 904

[62]

McKinney W. Python for Data Analysis, 2nd edn. O’Reilly Media, Inc. 2018.

[63]

Mecca G, Papotti P, Santoro D. A short history of schema mapping systems. In: Twentieth Italian Symposium on Advanced Database Systems, SEBD 2012, 2012:99–106, http://sebd2012.dei.unipd.it/documents/188475/efd4de94-b0b6-4979-8f60-3628f30d6f03

[64]

Nargesian F, Zhu E, Miller RJ, et al. Data lake management: Challenges and opportunities. Proc VLDB Endow 2019;12(12):1986–1989.

[65]

Nargesian F, Pu KQ, Zhu E, et al. Organizing data lakes for navigation. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD. ACM, 2020; 1939–1950,

[66]

Papenbrock T, Bergmann T, Finke M, et al. Data profiling with metanome. Proc VLDB Endow 2015;8(12):1860–1863.

[67]

Qian K, Popa L, Sen P. Active learning for large-scale entity resolution. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM. ACM, 2017:1379–1388,

[68]

Rahm E and Bernstein PA A survey of approaches to automatic schema matching VLDBJ 2001 10 4 334-350

[69]

Raman V, Hellerstein JM. Potter’s wheel: An interactive data cleaning system. VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases 2001:381–390

[70]

Rekatsinas T, Chu X, Ilyas IF, et al. Holoclean: Holistic data repairs with probabilistic inference. Proc VLDB Endow 2017;10(11):1190–1201.

[71]

Rostin A, Albrecht O, Bauckmann J, et al. A machine learning approach to foreign key discovery. In: 12th International Workshop on the Web and Databases, WebDB 2009.

[72]

Santu SKK, Hassan MM, Smith MJ, et al. Automl to date and beyond: Challenges and opportunities. ACM Comput Surv 2022;54(8):175:1–175:36.,

[73]

Singh R Blinkfill: Semi-supervised programming by example for syntactic string transformations PVLDB 2016 9 10 816-827

[74]

Stodder D. Improving Data Preparation for Business Analytics. Tech. rep., 2016. https://info.talend.com/rs/talend/images/WP_EN_DP_Improving_DataPrep_BusinessAnalytics.pdf

[75]

Stonebraker M, Bruckner D, Ilyas IF, et al. Data curation at scale: The data tamer system. In: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research 2013.

[76]

Sukhobok D, Nikolov N, Roman D. Tabular Data Anomaly Patterns. Proceedings - 2017 International Conference on Big Data Innovations and Applications, Innovate-Data 2017 2018-January:25–34. 2018.

[77]

Terrizzano I, Schwarz PM, Roth M, et al. Data wrangling: The challenging journey from the wild to the lake. In: CIDR 2015.

[78]

Thirumuruganathan S, Tang N, Ouzzani M, et al. Data curation with deep learning. In: Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020. OpenProceedings.org, 2020:277–286,

[79]

Vassiliadis P A survey of extract-transform-load technology. IJDWM 2011 5 3 1-27

[80]

Verborgh R, Wilde MD. Using OpenRefine, 1st edn. Packt Publishing 2013.

[81]

Waller T, Korbel J, and Stys M Cloveretl designer: User’s guide 2018 Javlin Tech. rep

[82]

Wu B, Knoblock CA. An iterative approach to synthesize data transformation programs. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015:1726–1732, http://ijcai.org/Abstract/15/246

[83]

Yang J, He Y, Chaudhuri S. Auto-pipeline: Synthesize data pipelines by-target using reinforcement learning and search. Proc VLDB Endow 2021;14(11):2563–2575. http://www.vldb.org/pvldb/vol14/p2563-he.pdf

[84]

Zhu E, He Y, Chaudhuri S. Auto-join: Joining tables by leveraging transformations. Proc VLDB Endow 2017;10(10):1034–1045.

Cited By

Walch ASzabo ASteinlechner HOrtner TGröller ESchmidt J(2024)BEMTrace: Visualization-Driven Approach for Deriving Building Energy Models from BIMIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345631531:1(240-250)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3456315
Thakur AKumar AMishra SBehera SSethi JSahu SSwain S(2024)Product Length Predictions with Machine Learning: An Integrated Approach Using Extreme Gradient BoostingSN Computer Science10.1007/s42979-024-02999-85:6Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1007/s42979-024-02999-8

Recommendations

Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.)holds the potential ...
Data Preparation: A Survey of Commercial Tools

Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data ...
Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SN Computer Science

SN Computer Science Volume 4, Issue 4

Apr 2023

1389 pages

EISSN:2661-8907

Issue’s Table of Contents

© The Author(s) 2023.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 02 June 2023

Accepted: 10 April 2023

Received: 16 May 2022

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Walch ASzabo ASteinlechner HOrtner TGröller ESchmidt J(2024)BEMTrace: Visualization-Driven Approach for Deriving Building Energy Models from BIMIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345631531:1(240-250)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3456315
Thakur AKumar AMishra SBehera SSethi JSahu SSwain S(2024)Product Length Predictions with Machine Learning: An Integrated Approach Using Extreme Gradient BoostingSN Computer Science10.1007/s42979-024-02999-85:6Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1007/s42979-024-02999-8

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents