[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3580305.3599553acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract
Free access

Data-centric AI: Techniques and Future Perspectives

Published: 04 August 2023 Publication History

Abstract

The role of data in AI has been significantly magnified by the emerging concept of data-centric AI. In contrast to the traditional model-centric paradigm, which focuses on developing more effective models given fixed datasets, data-centric AI emphasizes the systematic engineering of data in building AI systems. However, as a new concept, many critical aspects of data-centric AI remain ambiguous, such as its definitions, associated tasks, algorithms, challenges, and benchmarks. This tutorial aims to review and discuss this emerging field, with a particular focus on the three general data-centric AI goals: training data development, inference data development, and data maintenance. The objective of this tutorial is threefold: (1) to formally categorize the field of data-centric AI using a goal-driven taxonomy and discuss the needs and challenges of each goal, (2) to comprehensively review the state-of-the-art techniques, and (3) to discuss the future perspectives and open research directions to inspire further innovations in this field.

References

[1]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In ICDE.
[2]
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In ICDE.
[3]
Steven L Franconeri, Lace M Padilla, Priti Shah, Jeffrey M Zacks, and Jessica Hullman. 2021. The science of visual data communication: What works. Psychological Science in the public interest, Vol. 22, 3 (2021), 110--161.
[4]
Amirata Ghorbani, Michael Kim, and James Zou. 2020. A distributional framework for data valuation. In ICML.
[5]
Sanjay Krishnan and Eugene Wu. 2019. Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019).
[6]
Kwei-Herng Lai, Daochen Zha, Guanchu Wang, Junjie Xu, Yue Zhao, Devesh Kumar, Yile Chen, Purav Zumkhawaka, Minyang Wan, Diego Martinez, et al. 2021a. Tods: An automated time series outlier detection system. In AAAI.
[7]
Kwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, and Xia Hu. 2021b. Revisiting time series outlier detection: Definitions and benchmarks. In NeurIPS.
[8]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM computing surveys (CSUR), Vol. 50, 6 (2017), 1--45.
[9]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys, Vol. 55, 9 (2023), 1--35.
[10]
Zirui Liu, Haifeng Jin, Ting-Hsiang Wang, Kaixiong Zhou, and Xia Hu. 2021. DivAug: plug-in automated data augmentation with explicit diversity maximization. In CVPR.
[11]
Diego Martinex, Daochen Zha, Qiaoyu Tan, and Xia Hu. 2023. Towards Personalized Preprocessing Pipeline Search. arXiv preprint arXiv:2302.14329 (2023).
[12]
Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. NeurIPS (2016).
[13]
Leixian Shen, Enya Shen, Yuyu Luo, Xiaocong Yang, Xuming Hu, Xiongshuai Zhang, Zhiwei Tai, and Jianmin Wang. 2021. Towards natural language interfaces for data visualization: A survey. arXiv preprint arXiv:2109.03506 (2021).
[14]
Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. 2007. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, Vol. 8, 5 (2007).
[15]
Yuqi Wang, Qinghua Wang, Hongzhan Huang, Wei Huang, Yongxing Chen, Peter B McGarvey, Cathy H Wu, Cecilia N Arighi, and UniProt Consortium. 2021. A crowdsourcing open platform for literature curation in UniProt. PLoS biology, Vol. 19, 12 (2021), e3001464.
[16]
Fan Yang, Sahan Suresh Alva, Jiahao Chen, and Xia Hu. 2021. Model-based counterfactual synthesizer for interpretation. In KDD.
[17]
Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, and Xia Hu. 2023. Data-centric AI: Perspectives and Challenges. In SDM.
[18]
Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. 2013. Data-centric Artificial Intelligence: A Survey. arXiv preprint arXiv:2303.10158 (2013).
[19]
Daochen Zha, Kwei-Herng Lai, Qiaoyu Tan, Sirui Ding, Na Zou, and Xia Ben Hu. 2022a. Towards automated imbalanced learning with deep hierarchical reinforcement learning. In CIKM.
[20]
Daochen Zha, Kwei-Herng Lai, Mingyang Wan, and Xia Hu. 2020. Meta-AAD: Active anomaly detection with deep reinforcement learning. In ICDM.
[21]
Daochen Zha, Zaid Pervaiz Bhat, Yi-Wei Chen, Yicheng Wang, Sirui Ding, Anmoll Kumar Jain, Mohammad Qazim Bhat, Kwei-Herng Lai, Jiaben Chen, et al. 2022b. AutoVideo: An Automated Video Action Recognition System. In IJCAI.
[22]
Jiliang Zhang and Chen Li. 2019. Adversarial examples: Opportunities and challenges. IEEE transactions on neural networks and learning systems, Vol. 31, 7 (2019), 2578--2593.
[23]
Xuanhe Zhou, Lianyuan Jin, Ji Sun, Xinyang Zhao, Xiang Yu, Jianhua Feng, Shifu Li, Tianqing Wang, Kun Li, and Luyang Liu. 2021. Dbmind: A self-driving platform in opengauss. In VLDB.

Cited By

View all
  • (2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
  • (2024)ML-Based Pain Recognition Model Using Mixup Data AugmentationApplied System Innovation10.3390/asi70601247:6(124)Online publication date: 9-Dec-2024
  • (2024)DCAI: Data-centric Artificial IntelligenceCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3641297(1482-1485)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Data-centric AI: Techniques and Future Perspectives

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2023
    5996 pages
    ISBN:9798400701030
    DOI:10.1145/3580305
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2023

    Check for updates

    Author Tags

    1. ai
    2. data-centric ai

    Qualifiers

    • Abstract

    Conference

    KDD '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)491
    • Downloads (Last 6 weeks)53
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Data-Centric AI Paradigm for Socio-Industrial and Global ChallengesElectronics10.3390/electronics1311215613:11(2156)Online publication date: 1-Jun-2024
    • (2024)ML-Based Pain Recognition Model Using Mixup Data AugmentationApplied System Innovation10.3390/asi70601247:6(124)Online publication date: 9-Dec-2024
    • (2024)DCAI: Data-centric Artificial IntelligenceCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3641297(1482-1485)Online publication date: 13-May-2024
    • (2024)High‐quality expert annotations enhance artificial intelligence model accuracy for osteosarcoma X‐ray diagnosisCancer Science10.1111/cas.16330115:11(3695-3704)Online publication date: 2-Sep-2024
    • (2024)Machine Learning for Failure Management in Microwave Networks: A Data-Centric ApproachIEEE Transactions on Network and Service Management10.1109/TNSM.2024.340693421:5(5420-5431)Online publication date: Oct-2024
    • (2024)Fundus Image Quality Assessment: a Brief Review of Techniques2024 International Conference on Information Processes and Systems Development and Quality Assurance (IPS)10.1109/IPS62349.2024.10499480(50-54)Online publication date: 19-Mar-2024
    • (2024)A Review of Data-Centric Artificial Intelligence (DCAI) and its Impact on manufacturing Industry: Challenges, Limitations, and Future Directions2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00018(44-51)Online publication date: 25-Jun-2024
    • (2024)COMET : “cone of experience” enhanced large multimodal model for mathematical problem generationScience China Information Sciences10.1007/s11432-024-4242-067:12Online publication date: 11-Dec-2024
    • (2024)Callico: A Versatile Open-Source Document Image Annotation PlatformDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70543-4_20(338-353)Online publication date: 30-Aug-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    Access Granted

    The conference sponsors are committed to making content openly accessible in a timely manner.
    This article is provided by ACM and the conference, through the ACM OpenTOC service.