[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3511808.3557160acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Demonstrating SubStrat: A Subset-Based Strategy for Faster AutoML on Large Datasets

Published: 17 October 2022 Publication History

Abstract

Automated machine learning (AutoML) frameworks are gaining popularity among data scientists as they dramatically reduce the manual work devoted to the construction of ML pipelines while obtaining similar and sometimes even better results than manually-built models. Such frameworks intelligently search among millions of possible ML pipeline configurations to finally retrieve an optimal pipeline in terms of predictive accuracy. However, when the training dataset is large, the construction and evaluation of a single ML pipeline take longer, which makes the overall AutoML running times increasingly high.
To this end, in this work we demonstrate SubStrat, an AutoML optimization strategy that tackles the dataset size rather than the configurations search space. SubStrat wraps existing AutoML tools, and instead of executing them directly on the large dataset, it uses a genetic-based algorithm to find a small yet representative data subset that preserves characteristics of the original one. SubStrat then employs the AutoML tool on the generated subset, resulting in an intermediate ML pipeline, which is later refined by executing a restricted, much shorter, AutoML process on the large dataset. We demonstrate SubStrat on both AutoSklearn, TPOT, and H2O, three popular AutoML frameworks, using several real-life datasets.

Supplementary Material

MP4 File (CIKM22-demo15.mp4)
Presentation video

References

[1]
Kaggle Flight Delays Dataset. 2022. https://www.kaggle.com/usdot/flight-delays.
[2]
Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-sklearn 2.0: Hands-free automl via meta-learning. arXiv preprint arXiv:2007.04074 (2020).
[3]
M. Feurer, A. Klevin, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning.
[4]
Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A Survey of the State-of-the-Art. Knowledge-Based Systems, Vol. 212 (2021), 106622.
[5]
Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. Deepline: Automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2103--2113.
[6]
J. H. Holland. 1992. Genetic Algorithms. Scientific American, Vol. 267, 1 (1992), 66--73.
[7]
Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization. Springer, 507--523.
[8]
Haifeng Jin, Qingquan Song, and Xia Hu. 2019. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1946--1956.
[9]
Shubhra Kanti Karmaker, Md Mahadi Hassan, Micah J Smith, Lei Xu, Chengxiang Zhai, and Kalyan Veeramachaneni. 2021. AutoML to Date and Beyond: Challenges and Opportunities. ACM Computing Surveys (CSUR), Vol. 54, 8 (2021), 1--36.
[10]
Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, and Carol Willing. 2016. Jupyter Notebooks -- a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, F. Loizides and B. Schmidt (Eds.). IOS Press, 87 -- 90.
[11]
Teddy Lazebnik, Amit Somech, and Abraham Itzhak Weinberg. 2022. SubStrat: A Subset-Based Strategy for Faster AutoML. arXiv preprint arXiv:2206.03070 (2022).
[12]
Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020.
[13]
Randal S Olson and Jason H Moore. 2016a. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on automatic machine learning. PMLR, 66--74.
[14]
R. S. Olson and J. H. Moore. 2016b. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. In JMLR: Workshop and Conference Proceedings, Vol. 64. 66--74.
[15]
Esteban Real, Chen Liang, David So, and Quoc Le. 2020. Automl-zero: Evolving machine learning algorithms from scratch. In International Conference on Machine Learning. PMLR, 8007--8019.
[16]
SubStrat Github Repository. 2022. https://github.com/teddy4445/SubStrat.
[17]
Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. 2021. FLAML: a fast and lightweight AutoML Library. Proceedings of Machine Learning and Systems, Vol. 3 (2021), 434--447.
[18]
Duch Wlodzislaw, Tadeusz Wieczorek, Jacek Biesiada, and Marcin Blachnik. 2004. Comparison of feature ranking methods based on information entropy, Vol. 2. 1415 -- 1419 vol.2. https://doi.org/10.1109/IJCNN.2004.1380157
[19]
Qingyun Wu, Chi Wang, and Silu Huang. 2021. Frugal optimization for cost-related hyperparameters. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10347--10354.
[20]
Anatoly Yakovlev, Hesam Fathi Moghadam, Ali Moharrer, Jingxiao Cai, Nikan Chavoshi, Venkatanathan Varadarajan, Sandeep R Agrawal, Sam Idicula, Tomas Karnagel, Sanjay Jinturkar, et al. 2020. Oracle automl: a fast and predictive automl pipeline. PVLDB, Vol. 13, 12 (2020), 3166--3180.
[21]
Xiao Zhang, Changlin Mei, Degang Chen, and Jinhai Li. 2016. Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy. Pattern Recognition, Vol. 56 (2016), 1--15.

Cited By

View all

Index Terms

  1. Demonstrating SubStrat: A Subset-Based Strategy for Faster AutoML on Large Datasets

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
        October 2022
        5274 pages
        ISBN:9781450392365
        DOI:10.1145/3511808
        • General Chairs:
        • Mohammad Al Hasan,
        • Li Xiong
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 17 October 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. automated machine learning (AutoML)
        2. data reduction

        Qualifiers

        • Short-paper

        Conference

        CIKM '22
        Sponsor:

        Acceptance Rates

        CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
        Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

        Upcoming Conference

        CIKM '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 100
          Total Downloads
        • Downloads (Last 12 months)16
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 12 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media