More Web Proxy on the site http://driver.im/

research-article

Open access

AutoDW: Automatic Data Wrangling Leveraging Large Language Models

Authors:

Shailaja Keyur Sampat,

Maria Xenochristou,

Taisei Kakibuchi,

Tatsuya AsaiAuthors Info & Claims

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Pages 2041 - 2052

https://doi.org/10.1145/3691620.3695267

Published: 27 October 2024 Publication History

Abstract

Data wrangling is a critical yet often labor-intensive process, essential for transforming raw data into formats suitable for downstream tasks such as machine learning or data analysis. Traditional data wrangling methods can be time-consuming, resource-intensive, and prone to errors, limiting the efficiency and effectiveness of subsequent downstream tasks. In this paper, we introduce AutoDW: an end-to-end solution for automatic data wrangling that leverages the power of Large Language Models (LLMs) to enhance automation and intelligence in data preparation. AutoDW distinguishes itself through several innovative features, including comprehensive automation that minimizes human intervention, the integration of LLMs to enable advanced data processing capabilities, and the generation of source code for the entire wrangling process, ensuring transparency and reproducibility. These advancements position AuoDW as a superior alternative to existing data wrangling tools, offering significant improvements in efficiency, accuracy, and flexibility. Through detailed performance evaluations, we demonstrate the effectiveness of AutoDW for data wrangling. We also discuss our experience and lessons learned from the industrial deployment of AutoDW, showcasing its potential to transform the landscape of automated data preparation.

References

[1]

Data Wrangling Market Size & Share Analysis Growth Trends & Forecasts (2024 2029). accessed July 02, 2024. https://www.mordorintelligence.com/industry-reports/data-wrangling-market

[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

[3]

all-mpnet-base v2. accessed July 10, 2024. https://huggingface.co/sentence-transformers/all-mpnet-base-v2

[4]

Alteryx. accessed July 10, 2024. https://www.alteryx.com/

[5]

Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et al. 2017. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1387--1395.

Digital Library

[6]

John T Behrens. 1997. Principles and procedures of exploratory data analysis. Psychological methods 2, 2 (1997), 131.

[7]

Bradley C Boehmke. 2016. Data wrangling with R. Springer.

[8]

Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5--32.

[9]

Peter J Brockwell and Richard A Davis. 2002. Introduction to time series and forecasting. Springer.

[10]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.

[11]

Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. 2022. The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529 (2022).

[12]

Li Cai and Yangyong Zhu. 2015. The challenges of data quality and data quality assessment in the big data era. Data science journal 14 (2015), 2--2.

[13]

Erik Cambria and Bebo White. 2014. Jumping NLP curves: A review of natural language processing research. IEEE Computational intelligence magazine 9, 2 (2014), 48--57.

[14]

Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, and Minjoon Seo. 2024. How Do Large Language Models Acquire Factual Knowledge During Pretraining? arXiv preprint arXiv:2406.11813 (2024).

[15]

Yi-Wei Chen, Qingquan Song, and Xia Hu. 2021. Techniques for automated machine learning. ACM SIGKDD Explorations Newsletter 22, 2 (2021), 35--50.

Digital Library

[16]

Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 international conference on management of data. 2201--2206.

Digital Library

[17]

Cohere. accessed September 12, 2024. https://cohere.com/

[18]

SapientML Configuration. accessed July 09, 2024. https://sapientml.readthedocs.io/en/latest/user/configuration.html#parameters-for-sapientml

[19]

dataprep. accessed July 10, 2024. https://dataprep.ai/

[20]

DataRobot. accessed July 10, 2024. https://www.datarobot.com/

[21]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[22]

DistilBERT. accessed July 10, 2024. https://huggingface.co/distilbert/distilbert-base-multilingual-cased

[23]

David P. Doane. 1976. Aesthetic Frequency Classifications. The American Statistician 30, 4 (1976), 181--183.

[24]

dotData. accessed July 10, 2024. https://dotdata.com/

[25]

Florian Endel and Harald Piringer. 2015. Data Wrangling: Making data useful again. IFAC-PapersOnLine 48, 1 (2015), 111--112.

[26]

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020).

[27]

Azure Data Factory. accessed July 10, 2024. https://azure.microsoft.com/en-us/products/data-factory

[28]

Streamlit A faster way to build and share data apps. accessed July 10, 2024. https://streamlit.io/

[29]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. Advances in neural information processing systems 28 (2015).

[30]

Azure Functions. accessed July 08, 2024. https://azure.microsoft.com/en-us/products/functions

[31]

Benjamin CM Fung, Ke Wang, and S Yu Philip. 2007. Anonymizing classification data for privacy preservation. IEEE transactions on knowledge and data engineering 19, 5 (2007), 711--725.

Digital Library

[32]

Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, and Norman W Paton. 2016. Data wrangling for big data: Challenges and opportunities. In 19th International Conference on Extending Database Technology. 473--478.

[33]

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021. Association for Computational Linguistics (ACL), 3816--3830.

[34]

Corinna Giebler, Christoph Gröger, Eva Hoos, Holger Schwarz, and Bernhard Mitschang. 2019. Leveraging the data lake: current state and challenges. In Big Data Analytics and Knowledge Discovery: 21st International Conference, DaWaK 2019, Linz, Austria, August 26--29, 2019, Proceedings 21. Springer, 179--188.

Digital Library

[35]

GPT-4o. accessed July 10, 2024. https://openai.com/index/hello-gpt-4o/

[36]

Venkat Gudivada, Amy Apon, and Junhua Ding. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software 10, 1 (2017), 1--20.

[37]

Nitin Gupta, Shashank Mujumdar, Hima Patel, Satoshi Masuda, Naveen Panwar, Sambaran Bandyopadhyay, Sameep Mehta, Shanmukha Guttula, Shazia Afzal, Ruhi Sharma Mittal, et al. 2021. Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 4040--4041.

Digital Library

[38]

Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints (2023).

[39]

Wickham Hadley and Grolemund Garrett. 2016. R for data science: import, tidy, transform, visualize, and model data. O'Reilly Media, Inc (2016).

[40]

Kazi Saidul Hasan and Vincent Ng. 2014. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1262--1273.

[41]

Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems 212 (2021), 106622.

[42]

Noah Hollmann, Samuel Müller, and Frank Hutter. 2024. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems 36 (2024).

[43]

Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3561--3562.

Digital Library

[44]

Daniel P Jeong, Zachary C Lipton, and Pradeep Ravikumar. 2024. LLM-Select: Feature Selection with Large Language Models. arXiv preprint arXiv:2407.02694 (2024).

[45]

Jinja. accessed July 10, 2024. https://jinja.palletsprojects.com/en/3.1.x/

[46]

Richard Arnold Johnson, Dean W Wichern, et al. 2002. Applied multivariate statistical analysis. (2002).

[47]

Ian T Jolliffe. 2002. Principal component analysis for special types of data. Springer.

[48]

Karthik Kambatla, Giorgos Kollias, Vipin Kumar, and Ananth Grama. 2014. Trends in big data analytics. Journal of parallel and distributed computing 74, 7 (2014), 2561--2573.

[49]

Shubhra Kanti Karmaker, Md Mahadi Hassan, Micah J Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni. 2021. Automl to date and beyond: Challenges and opportunities. ACM Computing Surveys (CSUR) 54, 8 (2021), 1--36.

Digital Library

[50]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

[51]

Kaggle: Your Machine Learning and Data Science Community. accessed July 11, 2024. https://www.kaggle.com/

[52]

Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020. ICML San Diego, CA, USA.

[53]

In Lee and Kyoochun Lee. 2015. The Internet of Things (IoT): Applications, investments, and challenges for enterprises. Business horizons 58, 4 (2015), 431--440.

[54]

Junyi Li, Tianyi Tang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2105.10311 (2021).

[55]

Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. 2023. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In 2023 International Conference on Learning Representations (ICLR).

[56]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1--35.

Digital Library

[57]

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021. GPT understands, too. arXiv preprint arXiv:2103.10385 (2021).

[58]

TransmogrifAI: Automated machine learning for structured data. accessed July 10, 2024. https://transmogrif.ai/

[59]

Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. 2023. Prompt engineering in large language models. In International conference on data intelligence and cognitive informatics. Springer, 387--402.

[60]

Wes McKinney. 2013. Python for data analysis. " O'Reilly Media, Inc.".

[61]

Wes McKinney et al. 2010. Data structures for statistical computing in Python. In SciPy, Vol. 445. 51--56.

[62]

Wes McKinney et al. 2011. pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14, 9 (2011), 1--9.

[63]

Yinan Mei, Shaoxu Song, Chenguang Fang, Haifeng Yang, Jingyun Fang, and Jiang Long. 2021. Capturing semantics for imputation with pre-trained language models. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 61--72.

[64]

Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data? arXiv preprint arXiv:2205.09911 (2022).

[65]

Solomon Negash and Paul Gray. 2008. Business intelligence. Handbook on decision support systems 2 (2008), 175--193.

[66]

Jupyter Notebook. accessed July 10, 2024. https://jupyter.org/

[67]

A Grammar of Data Manipulation dplyr. accessed July 10, 2024. https://dplyr.tidyverse.org/

[68]

Randal S Olson and Jason H Moore. 2016. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on automatic machine learning. PMLR, 66--74.

[69]

OpenML. accessed July 10, 2024. https://www.openml.org/

[70]

pandas Python Data Analysis Library. accessed July 10, 2024. https://pandas.pydata.org/

[71]

paraphrase-multilingual-mpnet-base v2. accessed July 10, 2024. https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

[72]

Norman Paton. 2019. Automating data preparation: Can we? should we? must we?. In 21st International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data.

[73]

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2463--2473.

[74]

Alina Petukhova, Joao P Matos-Carvalho, and Nuno Fachada. 2024. Text clustering with LLM embeddings. arXiv preprint arXiv:2403.15112 (2024).

[75]

Aleksandra Płońska and Piotr Płoński. 2021. MLJAR: State-of-the-art Automated Machine Learning Framework for Tabular Data. https://github.com/mljar/mljar-supervised

[76]

Foster Provost and Tom Fawcett. 2013. Data science and its relationship to big data and data-driven decision making. Big data 1, 1 (2013), 51--59.

[77]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[78]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485--5551.

Digital Library

[79]

Tye Rattenbury, Joseph M Hellerstein, Jeffrey Heer, Sean Kandel, and Connor Carreras. 2017. Principles of data wrangling: Practical techniques for data preparation. " O'Reilly Media, Inc.".

[80]

Azure OpenAI Service REST API reference. accessed July 10, 2024. https://learn.microsoft.com/en-us/azure/ai-services/openai/reference

[81]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084

[82]

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910 (2020).

[83]

Ripon K Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, and Mukul R Prasad. 2022. SapientML: synthesizing machine learning pipelines by learning from human-writen solutions. In Proceedings of the 44th International Conference on Software Engineering. 1932--1944.

Digital Library

[84]

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927 (2024).

[85]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:cs.CL/1910.01108 https://arxiv.org/abs/1910.01108

[86]

Timo Schick and Hinrich Schütze. 2021. It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2339--2352.

[87]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.

Digital Library

[88]

SentenceTransformers. accessed July 10, 2024. https://sbert.net/

[89]

Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards benchmarking feature type inference for automl platforms. In Proceedings of the 2021 international conference on management of data. 1584--1596.

Digital Library

[90]

Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 1584--1596.

Digital Library

[91]

Ananya Singha, Bhavya Chopra, Anirudh Khatry, Sumit Gulwani, Austin Z Henley, Vu Le, Chris Parnin, Mukul Singh, and Gust Verbruggen. 2024. Semantically Aligned Question and Code Generation for Automated Insight Generation. arXiv preprint arXiv:2405.01556 (2024).

[92]

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11--21.

[93]

JSON-RPC 2.0 Specification. accessed July 08, 2024. https://www.jsonrpc.org/specification

[94]

Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, et al. 2021. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137 (2021).

[95]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).

[96]

Overcoming the 80/20 Rule in Data Science. accessed July 10, 2024. https://www.pragmaticinstitute.com/resources/articles/data/overcoming-the-80-20-rule-in-data-science/

[97]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[98]

SentenceTransformers Pre trained Models. accessed July 10, 2024. https://sbert.net/docs/sentence_transformer/pretrained_models.html

[99]

John Wilder Tukey et al. 1977. Exploratory data analysis. Vol. 2. Springer.

[100]

Jake VanderPlas. 2016. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.".

[101]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[102]

Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. 2021. FLAML: A fast and lightweight automl library. Proceedings of Machine Learning and Systems 3 (2021), 434--447.

[103]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368 (2023).

[104]

Amazon SageMaker Data Wrangler. accessed July 10, 2024. https://aws.amazon.com/sagemaker/data-wrangler/

[105]

Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna P Gummadi, and Evimaria Terzi. 2024. Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction. arXiv preprint arXiv:2404.12957 (2024).

[106]

Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data 18, 6 (2024), 1--32.

[107]

Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, and Kunpeng Liu. 2024. Dynamic and Adaptive Feature Generation with LLM. arXiv preprint arXiv:2406.03505 (2024).

[108]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).

[109]

Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan Huang. 2020. Evaluating commonsense in pre-trained language models. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 9733--9740.

Index Terms

AutoDW: Automatic Data Wrangling Leveraging Large Language Models

Index terms have been assigned to the content through auto-classification.

Recommendations

The VADA Architecture for Cost-Effective Data Wrangling
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Data wrangling, the multi-faceted process by which the data required by an application is identified, extracted, cleaned and integrated, is often cumbersome and labor intensive. In this paper, we present an architecture that supports a complete data ...
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
Abstract
Consolidation of the research information improves the quality of data integration, reducing duplicates between systems and enabling the required flexibility and scalability when processing various data sources. We assume that the combination of a ...
Data Wrangling for South African Smart City Crime Data
SAICSIT '20: Conference of the South African Institute of Computer Scientists and Information Technologists 2020

South Africa (S.A.) is currently facing economic and social challenges that could benefit from the implementation of international smart city guidelines. Crucial to transforming a city into a smart city is the collection and access to reliable data. One ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

October 2024

2587 pages

ISBN:9798400712487

DOI:10.1145/3691620

General Chair:
Vladimir Filkov,
Program Co-chairs:
Baishakhi Ray
Columbia University, USA; AWS AI Lab
,
Minghui Zhou
Peking University, China

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASE '24

Sponsor:

ASE '24: 39th IEEE/ACM International Conference on Automated Software Engineering

October 27 - November 1, 2024

CA, Sacramento, USA

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
94
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)71

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents