[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

DataRinse: Semantic Transforms for Data Preparation Based on Code Mining

Published: 01 August 2023 Publication History

Abstract

Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.

References

[1]
Ibrahim Abdelaziz, Julian Dolby, James P McCusker, and Kavitha Srinivas. 2021. A Toolkit for Generating Code Knowledge Graphs. The Eleventh International Conference on Knowledge Capture (K-CAP) (2021).
[2]
Rohan Bavishi, Shadaj Laddad, Hiroaki Yoshida, Mukul R Prasad, and Koushik Sen. 2021. VizSmith: Automated Visualization Synthesis by Mining Data-Science Notebooks. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 129--141.
[3]
José Pablo Cambronero, Raul Castro Fernandez, and Martin C Rinard. 2022. wranglesearch: Mining Data Wrangling Functions from Python Programs. https://www.josecambronero.com/publication/wranglesearch/wranglesearch/. [Online; accessed 31-May-2022].
[4]
Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018. Transform-data-by-example (TDE) an extensible search engine for data transformations. Proceedings of the VLDB Endowment 11, 10 (2018), 1165--1177.
[5]
Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations. Proc. VLDB Endow. 11, 10 (jun 2018), 1165--1177.
[6]
Zhongjun Jin, Michael R Anderson, Michael Cafarella, and HV Jagadish. 2017. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data. 683--698.
[7]
Udayan Khurana and Sainyam Galhotra. 2021. Semantic Concept Annotation for Tabular Data. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 844--853.
[8]
Vu Le and Sumit Gulwani. 2014. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 542--553.
[9]
Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data? arXiv preprint arXiv:2205.09911 (2022).
[10]
Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1539--1554.

Cited By

View all
  • (2024)Empirical Evidence on Conversational Control of GUI in Semantic AutomationProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645172(869-885)Online publication date: 18-Mar-2024

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 12
August 2023
685 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2023
Published in PVLDB Volume 16, Issue 12

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Empirical Evidence on Conversational Control of GUI in Semantic AutomationProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645172(869-885)Online publication date: 18-Mar-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media