[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3626246.3653389acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

The Hopsworks Feature Store for Machine Learning

Published: 09 June 2024 Publication History

Abstract

Data management is the most challenging aspect of building Machine Learning (ML) systems. ML systems can read large volumes of historical data when training models, but inference workloads are more varied, depending on whether it is a batch or online ML system. The feature store for ML has recently emerged as a single data platform for managing ML data throughout the ML lifecycle, from feature engineering to model training to inference.
In this paper, we present the Hopsworks feature store for machine learning as a highly available platform for managing feature data with API support for columnar, row-oriented, and similarity search query workloads. We introduce and address challenges solved by the feature stores related to feature reuse, how to organize data transformations, and how to ensure correct and consistent data between feature engineering, model training, and model inference. We present the engineering challenges in building high-performance query services for a feature store and show how Hopsworks outperforms existing cloud feature stores for training and online inference query workloads.

References

[1]
Hopsworks AB. 2020. RonDB: a distribution of NDB Cluster. Hopsworks AB. https://github.com/logicalclocks/rondb Retrieved Nov 24, 2023 from
[2]
Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał undefinedwitakowski, Michał Szafra'nski, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow., Vol. 13, 12 (aug 2020), 3411--3424. https://doi.org/10.14778/3415478.3415560
[3]
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, Vol. 87 (2020), 101374.
[4]
AWS. 2020. https://aws.amazon.com/sagemaker/feature-store Retrieved Nov 28, 2023 from
[5]
Paul Baltescu, Haoyu Chen, Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. 2022. ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD '22). Association for Computing Machinery, New York, NY, USA, 2703--2711. https://doi.org/10.1145/3534678.3539170
[6]
Christoph Brücke, Philipp H"artling, Rodrigo D Escobar Palacios, Hamesh Patel, and Tilmann Rabl. 2023. TPCx-AI-An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems. Proceedings of the VLDB Endowment, Vol. 16, 12 (2023), 3649--3661.
[7]
Clickhouse. 2023. Clickhouse Documentation. Clickhouse. https://clickhouse.com/docs/en/sql-reference/statements/select/join#asof-join-usage Retrieved Nov 24, 2023 from
[8]
Databricks. 2021. https://www.databricks.com/product/feature-store Retrieved Nov 28, 2023 from
[9]
Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. 2013. Oltp-bench: An extensible testbed for benchmarking relational databases. Proceedings of the VLDB Endowment, Vol. 7, 4 (2013), 277--288.
[10]
DuckDB. 2019. Introducing Apache Arrow Flight: A Framework for Fast Data Transport. https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight Retrieved Nov 24, 2023 from
[11]
DuckDB. 2023. DuckDB's AsOf Joins: Fuzzy Temporal Lookups. https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-lookups.html#what-is-an-asof-join Retrieved Nov 24, 2023 from
[12]
Zanoyan et al at Airbnb. 2018. Zipline Airbnb's ML Data Management Framework. Airbnb. https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/detail/68114.html Retrieved Nov 24, 2023 from
[13]
Ling et al at Uber. 2023. Innovative Recommendation Applications Using Two Tower Embeddings at Uber. Uber. https://www.uber.com/en-SE/blog/innovative-recommendation-applications-using-two-tower-embeddings/ Retrieved Nov 24, 2023 from
[14]
Great Expectations. 2017. https://greatexpectations.io Retrieved Nov 28, 2023 from
[15]
Feast. 2019. . https://github.com/feast-dev/feast Retrieved Nov 24, 2023 from
[16]
FeatureForm. 2019. The Open-Source Virtual Feature Store. https://www.featureform.com Retrieved Nov 28, 2023 from
[17]
featurestore.org. 2020. Feature Store Summit 2023. https://www.featurestore.org/feature-store-summit-2023 Retrieved Nov 24, 2023 from
[18]
featurestore.org. 2023. Feature Store Benchmarks. featurestore.org. https://github.com/featurestoreorg/featurestore-benchmarks Retrieved Nov 24, 2023 from
[19]
Frank Gadban and Julian Kunkel. 2021. Analyzing the Performance of the S3 Object Storage API for HPC Workloads. Applied Sciences, Vol. 11, 18 (2021), 8540.
[20]
Google. 2021. https://cloud.google.com/vertex-ai/docs/featurestore Retrieved Nov 28, 2023 from
[21]
Apache Hudi. 2016. Hudi Github Repository. Apache. https://github.com/apache/hudi Retrieved Nov 24, 2023 from
[22]
Apache Iceberg. 2017. . Apache. https://github.com/apache/iceberg Retrieved Nov 24, 2023 from
[23]
Mahmoud Ismail, Salman Niazi, Gautier Berthou, Mikael Ronström, Seif Haridi, and Jim Dowling. 2020. HopsFS-S3: Extending Object Stores with POSIX-like Semantics and More (Industry Track). In Proceedings of the 21st International Middleware Conference Industrial Track (Delft, Netherlands) (Middleware '20). Association for Computing Machinery, New York, NY, USA, 23--30. https://doi.org/10.1145/3429357.3430521
[24]
Martin Kaufmann, Peter M Fischer, Norman May, Andreas Tonder, and Donald Kossmann. 2014. Tpc-bih: A benchmark for bitemporal databases. In Performance Characterization and Benchmarking: 5th TPC Technology Conference, TPCTC 2013, Trento, Italy, August 26, 2013, Revised Selected Papers 5. Springer, Frankfurt, Germany, 16--31.
[25]
Kinetica. 2023. Kinetica Documentation. Kinetica. https://docs.kinetica.com/7.1/sql/query/#asof Retrieved Nov 24, 2023 from
[26]
Krishna Kulkarni and Jan-Eike Michels. 2012. Temporal features in SQL: 2011. ACM Sigmod Record, Vol. 41, 3 (2012), 34--43.
[27]
KX. 2023. KX Documentation. KX. https://code.kx.com/q/ref/asof/ Retrieved Nov 24, 2023 from
[28]
Guoliang Li and Chao Zhang. 2022. HTAP Databases: What is New and What is Next. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2483--2488. https://doi.org/10.1145/3514221.3522565
[29]
Rui Liu, Kwanghyun Park, Fotis Psallidas, Xiaoyong Zhu, Jinghui Mo, Rathijit Sen, Matteo Interlandi, Konstantinos Karanasos, Yuanyuan Tian, and Jesú s Camacho-Rodr'i guez. 2023. Optimizing Data Pipelines for Machine Learning in Feature Stores. Proc. VLDB Endow., Vol. 16, 13 (2023), 4230--4239. https://www.vldb.org/pvldb/vol16/p4230-camacho-rodriguez.pdf
[30]
Locust.io. 2011. . locust.io. https://github.com/locustio/locust Retrieved Nov 24, 2023 from
[31]
Igor L. Markov, Hanson Wang, Nitya S. Kasturi, Shaun Singh, Mia R. Garrard, Yin Huang, Sze Wai Celeste Yuen, Sarah Tran, Zehui Wang, Igor Glotov, Tanvi Gupta, Peng Chen, Boshuang Huang, Xiaowen Xie, Michael Belkin, Sal Uryasev, Sam Howie, Eytan Bakshy, and Norm Zhou. 2022. Looper: An End-to-End ML Platform for Product Decisions. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD '22). Association for Computing Machinery, New York, NY, USA, 3513--3523. https://doi.org/10.1145/3534678.3539059
[32]
Salman Niazi, Mikael Ronström, Seif Haridi, and Jim Dowling. 2018. Size matters: Improving the performance of small files in hadoop. In Proceedings of the 19th international middleware conference. USENIX Association, USA, 26--39.
[33]
OpenSearch. 2021. Using OpenSearch as a Vector Database. https://opensearch.org/platform/search/vector-database.html Retrieved Nov 24, 2023 from
[34]
Axel Pettersson. 2022. Resource-efficient and fast Point-in-Time joins for Apache Spark : Optimization of time travel operations for the creation of machine learning training datasets. Master's thesis. KTH, School of Electrical Engineering and Computer Science (EECS).
[35]
Polars. 2023. Polars Documentation. Polars. https://pola-rs.github.io/polars/user-guide/transformations/joins/#left-join Retrieved Nov 24, 2023 from
[36]
M Poongodi, Mohit Malviya, Chahat Kumar, Mounir Hamdi, V Vijayakumar, Jamel Nebhen, and Hasan Alyamani. 2022. New York City taxi trip duration prediction using MLP and XGBoost. International Journal of System Assurance Engineering and Management, Vol. 13, 1 (2022), 1--12.
[37]
OpenSearch Project. 2021. OpenSearch k-NN. AWS. https://github.com/opensearch-project/k-NN Retrieved Nov 24, 2023 from
[38]
QuestDB. 2023. QuestDB Documentation. QuestDB. https://questdb.io/docs/reference/sql/join/#left-outer-join Retrieved Nov 24, 2023 from
[39]
Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: An Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1981--1984. https://doi.org/10.1145/3299869.3320212
[40]
Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. 2019. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., Neurips. https://proceedings.neurips.cc/paper_files/paper/2019/file/846c260d715e5b854ffad5f70a516c88-Paper.pdf
[41]
RonDB. 2021a. Non-blocking 2PC. Hopsworks AB. https://docs.rondb.com/rondb_nonblocking_2pc Retrieved Nov 24, 2023 from
[42]
RonDB. 2021b. Query Threads in RonDB. Hopsworks AB. https://www.rondb.com/post/rondb-automatic-thread-configuration Retrieved Nov 24, 2023 from
[43]
RonDB. 2023 a. Fully Replicated Tables in RonDB. Hopsworks AB. https://docs.rondb.com/intro_relational/#fully-replicated-tables Retrieved Nov 24, 2023 from
[44]
RonDB. 2023 b. Geographic replication in RonDB. Hopsworks AB. https://docs.rondb.com/rondb_multi_site/ Retrieved Nov 24, 2023 from
[45]
Mikael Ronström. 2018. MySQL Cluster 7.5 inside and out. BoD-Books on Demand, Stockholm, Sweden.
[46]
Chuanwei Ruan, Allan Stewart, Han Li, Ryan Ye, David Vengerov, and Haixun Wang. 2023. Dynamic Embedding-Based Retrieval for Personalized Item Recommendations at Instacart. In Companion Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW '23 Companion). Association for Computing Machinery, New York, NY, USA, 983--987. https://doi.org/10.1145/3543873.3587668
[47]
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine Learning: The High Interest Credit Card of Technical Debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). Google, Mountain View, CA, USA.
[48]
Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty. 2021. Kafka: the definitive guide. "O'Reilly Media, Inc.".
[49]
Splicemachine. 2012. https://www.splicemachine.com Retrieved Sept 18, 2023 from
[50]
Tecton. 2019. https://www.tecton.ai Retrieved Nov 28, 2023 from
[51]
Uber. 2017. Meet Michelangelo: Uber's Machine Learning Platform. https://www.uber.com/en-SE/blog/michelangelo-machine-learning-platform Retrieved Nov 24, 2023 from io

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
June 2024
694 pages
ISBN:9798400704222
DOI:10.1145/3626246
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Author Tags

  1. arrow flight
  2. duckdb
  3. feature store
  4. mlops
  5. rondb

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,034
    Total Downloads
  • Downloads (Last 12 months)1,034
  • Downloads (Last 6 weeks)165
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media