More Web Proxy on the site http://driver.im/

abstract

Free access

Data-centric AI: Techniques and Future Perspectives

Authors:

Kwei-Herng Lai,

Xia HuAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 5839 - 5840

https://doi.org/10.1145/3580305.3599553

Published: 04 August 2023 Publication History

Abstract

The role of data in AI has been significantly magnified by the emerging concept of data-centric AI. In contrast to the traditional model-centric paradigm, which focuses on developing more effective models given fixed datasets, data-centric AI emphasizes the systematic engineering of data in building AI systems. However, as a new concept, many critical aspects of data-centric AI remain ambiguous, such as its definitions, associated tasks, algorithms, challenges, and benchmarks. This tutorial aims to review and discuss this emerging field, with a particular focus on the three general data-centric AI goals: training data development, inference data development, and data maintenance. The objective of this tutorial is threefold: (1) to formally categorize the field of data-centric AI using a goal-driven taxonomy and discuss the needs and challenges of each goal, (2) to comprehensively review the state-of-the-art techniques, and (3) to discuss the future perspectives and open research directions to inspire further innovations in this field.

References

[1]

Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In ICDE.

[2]

Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In ICDE.

[3]

Steven L Franconeri, Lace M Padilla, Priti Shah, Jeffrey M Zacks, and Jessica Hullman. 2021. The science of visual data communication: What works. Psychological Science in the public interest, Vol. 22, 3 (2021), 110--161.

[4]

Amirata Ghorbani, Michael Kim, and James Zou. 2020. A distributional framework for data valuation. In ICML.

[5]

Sanjay Krishnan and Eugene Wu. 2019. Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019).

[6]

Kwei-Herng Lai, Daochen Zha, Guanchu Wang, Junjie Xu, Yue Zhao, Devesh Kumar, Yile Chen, Purav Zumkhawaka, Minyang Wan, Diego Martinez, et al. 2021a. Tods: An automated time series outlier detection system. In AAAI.

[7]

Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. 2021b. Revisiting time series outlier detection: Definitions and benchmarks. In NeurIPS.

[8]

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM computing surveys (CSUR), Vol. 50, 6 (2017), 1--45.

[9]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys, Vol. 55, 9 (2023), 1--35.

Digital Library

[10]

Zirui Liu, Haifeng Jin, Ting-Hsiang Wang, Kaixiong Zhou, and Xia Hu. 2021. DivAug: plug-in automated data augmentation with explicit diversity maximization. In CVPR.

[11]

Diego Martinex, Daochen Zha, Qiaoyu Tan, and Xia Hu. 2023. Towards Personalized Preprocessing Pipeline Search. arXiv preprint arXiv:2302.14329 (2023).

[12]

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. NeurIPS (2016).

Digital Library

[13]

Leixian Shen, Enya Shen, Yuyu Luo, Xiaocong Yang, Xuming Hu, Xiongshuai Zhang, Zhiwei Tai, and Jianmin Wang. 2021. Towards natural language interfaces for data visualization: A survey. arXiv preprint arXiv:2109.03506 (2021).

[14]

Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. 2007. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, Vol. 8, 5 (2007).

[15]

Yuqi Wang, Qinghua Wang, Hongzhan Huang, Wei Huang, Yongxing Chen, Peter B McGarvey, Cathy H Wu, Cecilia N Arighi, and UniProt Consortium. 2021. A crowdsourcing open platform for literature curation in UniProt. PLoS biology, Vol. 19, 12 (2021), e3001464.

[16]

Fan Yang, Sahan Suresh Alva, Jiahao Chen, and Xia Hu. 2021. Model-based counterfactual synthesizer for interpretation. In KDD.

[17]

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. 2023. Data-centric AI: Perspectives and Challenges. In SDM.

[18]

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. 2013. Data-centric Artificial Intelligence: A Survey. arXiv preprint arXiv:2303.10158 (2013).

[19]

Daochen Zha, Kwei-Herng Lai, Qiaoyu Tan, Sirui Ding, Na Zou, and Xia Ben Hu. 2022a. Towards automated imbalanced learning with deep hierarchical reinforcement learning. In CIKM.

[20]

Daochen Zha, Kwei-Herng Lai, Mingyang Wan, and Xia Hu. 2020. Meta-AAD: Active anomaly detection with deep reinforcement learning. In ICDM.

[21]

Daochen Zha, Zaid Pervaiz Bhat, Yi-Wei Chen, Yicheng Wang, Sirui Ding, Anmoll Kumar Jain, Mohammad Qazim Bhat, Kwei-Herng Lai, Jiaben Chen, et al. 2022b. AutoVideo: An Automated Video Action Recognition System. In IJCAI.

[22]

Jiliang Zhang and Chen Li. 2019. Adversarial examples: Opportunities and challenges. IEEE transactions on neural networks and learning systems, Vol. 31, 7 (2019), 2578--2593.

[23]

Xuanhe Zhou, Lianyuan Jin, Ji Sun, Xinyang Zhao, Xiang Yu, Jianhua Feng, Shifu Li, Tianqing Wang, Kun Li, and Luyang Liu. 2021. Dbmind: A self-driving platform in opengauss. In VLDB.

Digital Library

Cited By

Majeed AHwang S(2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
https://doi.org/10.3390/electronics13112156
Shantharam RSchwenker F(2024)ML-Based Pain Recognition Model Using Mixup Data AugmentationApplied System Innovation10.3390/asi70601247:6(124)Online publication date: 9-Dec-2024
https://doi.org/10.3390/asi7060124
Jin WWang HZha DTan QMa YLi SLee SChua TNgo CKumar RLauw HKa-Wei Lee R(2024)DCAI: Data-centric Artificial IntelligenceCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3641297(1482-1485)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3641297
Show More Cited By

Index Terms

Data-centric AI: Techniques and Future Perspectives
1. Computing methodologies
  1. Machine learning

Recommendations

Tabular Data-centric AI: Challenges, Techniques and Future Perspectives
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Tabular data are the most widely used data formats in almost every application domain, such as, biology, ecology, and material science. The purpose of tabular data-centric AI is to use AI to augment the predictive power of tabular data to get better AI. ...
Perspectives on Cognitive Informatics and Cognitive Computing

Cognitive informatics is a transdisciplinary enquiry of computer science, information sciences, cognitive science, and intelligence science that investigates the internal information processing mechanisms and processes of the brain and natural ...
DCAI: Data-centric Artificial Intelligence
WWW '24: Companion Proceedings of the ACM Web Conference 2024

The emergence of Data-centric AI (DCAI) represents a pivotal shift in AI development, redirecting focus from model refinement to prioritizing data quality. This paradigmatic transition emphasizes the critical role of data in AI. While past approaches ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2023

5996 pages

ISBN:9798400701030

DOI:10.1145/3580305

General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '23

Sponsor:

KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 6 - 10, 2023

CA, Long Beach, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
818
Total Downloads

Downloads (Last 12 months)491
Downloads (Last 6 weeks)53

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Majeed AHwang S(2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
https://doi.org/10.3390/electronics13112156
Shantharam RSchwenker F(2024)ML-Based Pain Recognition Model Using Mixup Data AugmentationApplied System Innovation10.3390/asi70601247:6(124)Online publication date: 9-Dec-2024
https://doi.org/10.3390/asi7060124
Jin WWang HZha DTan QMa YLi SLee SChua TNgo CKumar RLauw HKa-Wei Lee R(2024)DCAI: Data-centric Artificial IntelligenceCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3641297(1482-1485)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3641297
Hasei JNakahara ROtsuka YNakamura YHironari TKahara NMiwa SOhshika SNishimura SIkuta KOsaki SYoshida AFujiwara TNakata EKunisada TOzaki T(2024)High‐quality expert annotations enhance artificial intelligence model accuracy for osteosarcoma X‐ray diagnosisCancer Science10.1111/cas.16330115:11(3695-3704)Online publication date: 2-Sep-2024
https://doi.org/10.1111/cas.16330
Di Cicco NIbrahimi MMusumeci FBruschetta FMilano MPassera CTornatore M(2024)Machine Learning for Failure Management in Microwave Networks: A Data-Centric ApproachIEEE Transactions on Network and Service Management10.1109/TNSM.2024.340693421:5(5420-5431)Online publication date: Oct-2024
https://doi.org/10.1109/TNSM.2024.3406934
Volkov EAverkin A(2024)Fundus Image Quality Assessment: a Brief Review of Techniques2024 International Conference on Information Processes and Systems Development and Quality Assurance (IPS)10.1109/IPS62349.2024.10499480(50-54)Online publication date: 19-Mar-2024
https://doi.org/10.1109/IPS62349.2024.10499480
Nieberl MZeiser ATiminger H(2024)A Review of Data-Centric Artificial Intelligence (DCAI) and its Impact on manufacturing Industry: Challenges, Limitations, and Future Directions2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00018(44-51)Online publication date: 25-Jun-2024
https://doi.org/10.1109/CAI59869.2024.00018
Liu SFeng JYang ZLuo YWan QShen XSun J(2024)COMET : “cone of experience” enhanced large multimodal model for mathematical problem generationScience China Information Sciences10.1007/s11432-024-4242-067:12Online publication date: 11-Dec-2024
https://doi.org/10.1007/s11432-024-4242-0
Kermorvant CBardou EBlanco MAbadie B(2024)Callico: A Versatile Open-Source Document Image Annotation PlatformDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70543-4_20(338-353)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70543-4_20

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents