[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3415959.3415996acmotherconferencesArticle/Chapter ViewAbstractPublication PagesrecsysConference Proceedingsconference-collections
research-article

GPU Accelerated Feature Engineering and Training for Recommender Systems

Published: 26 September 2020 Publication History

Abstract

In this paper we present our 1st place solution of the RecSys Challenge 2020 which focused on the prediction of user behavior, specifically the interaction with content, on this year’s dataset from competition host Twitter. Our approach achieved the highest score in seven of the eight metrics used to calculate the final leaderboard position. The 200 million tweet dataset required significant computation to do feature engineering and prepare the dataset for modelling, and our winning solution leveraged several key tools in order to accelerate our training pipeline. Within the paper we describe our exploratory data analysis (EDA) and training, the final features and models used, and the acceleration of the pipeline. Our final implementation runs entirely on GPU including feature engineering, preprocessing, and training the models. From our initial single threaded efforts in Pandas which took over ten hours we were able to accelerate feature engineering, preprocessing and training to two minutes and eighteen seconds, an end to end speedup of over 280x, using a combination of RAPIDS cuDF, Dask, UCX and XGBoost on a single machine with four NVIDIA V100 GPUs. Even when compared to heavily optimized code written later using Dask and Pandas on a 20 core CPU, our solution was still 25x faster. The acceleration of our pipeline was critical in our ability to quickly perform EDA which led to the discovery of a range of effective features used in the final solution, which is provided as open source [16].

References

[1]
adv1 2016. Adversarial validation, part one. Retrieved August 10th, 2020 from http://fastml.com/adversarial-validation-part-one/
[2]
adv2 2018. Adversarial validation, part one. Retrieved August 10th, 2020 from https://www.kaggle.com/tunguz/elo-adversarial-validation
[3]
Luca Belli, Sofia Ira Ktena, Alykhan Tejani, Alexandre Lung-Yut-Fon, Frank Portman, Xiao Zhu, Yuanpu Xie, Akshay Gupta, Michael Bronstein, Amra Delić, Gabriele Sottocornola, Walter Anelli, Nazareno Andrade, Jessie Smith, and Wenzhe Shi. 2020. Privacy-Preserving Recommender Systems Challenge on Twitter’s Home Timeline. (2020).
[4]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
[5]
cudf 2020. cuDF - GPU DataFrames. Retrieved July 6th, 2020 from https://github.com/rapidsai/cudf
[6]
dask 2020. Dask: Scalable analytics in Python. Retrieved July 6th, 2020 from https://dask.org/
[7]
daydatascience 2018. RAPIDS Accelerates Data Science End-to-End. Retrieved July 6th, 2020 from https://medium.com/rapids-ai/rapids-accelerates-data-science-end-to-end-afda1973b65d
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
[9]
Chuong B. Do, Quoc V. Le, and Chuan-Sheng Foo. 2009. Proximal Regularization for Online and Batch Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09). Association for Computing Machinery, New York, NY, USA, 257–264. https://doi.org/10.1145/1553374.1553407
[10]
Paweł Jankiewicz, Liudmyla Kyrashchuk, Paweł Sienkowski, and Magdalena Wójcik. 2019. Boosting Algorithms for a Session-Based, Context-Aware Recommender System in an Online Travel Domain. In Proceedings of the Workshop on ACM Recommender Systems Challenge (Copenhagen, Denmark) (RecSys Challenge ’19). Association for Computing Machinery, New York, NY, USA, Article 1, 5 pages. https://doi.org/10.1145/3359555.3359557
[11]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 3149–3157.
[12]
Liudmila Ostroumova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In NeurIPS.
[13]
pandas 2008. pandas - Python Data Analysis Library. Retrieved July 6th, 2020 from https://pandas.pydata.org/
[14]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108(2019).
[15]
Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez, Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L Graham, Liran Liss, 2015. UCX: an open source framework for HPC network APIs and beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 40–43.
[16]
solution 2020. NVIDIA RAPIDS.AI’s final model. Retrieved July 6th, 2020 from https://github.com/rapidsai/deeplearning/tree/master/RecSys2020
[17]
Gábor Takács and Domonkos Tikk. 2012. Alternating least squares for personalized ranking. (09 2012). https://doi.org/10.1145/2365952.2365972
[18]
Maksims Volkovs, Himanshu Rai, Zhaoyue Cheng, Ga Wu, Yichao Lu, and Scott Sanner. 2018. Two-Stage Model for Automatic Playlist Continuation at Scale. In Proceedings of the ACM Recommender Systems Challenge 2018 (Vancouver, BC, Canada) (RecSys Challenge ’18). Association for Computing Machinery, New York, NY, USA, Article 9, 6 pages. https://doi.org/10.1145/3267471.3267480
[19]
Maksims Volkovs, Guang Wei Yu, and Tomi Poutanen. 2017. Content-Based Neighbor Models for Cold Start in Recommender Systems. In Proceedings of the Recommender Systems Challenge 2017 (Como, Italy) (RecSys Challenge ’17). Association for Computing Machinery, New York, NY, USA, Article 7, 6 pages. https://doi.org/10.1145/3124791.3124792

Cited By

View all
  • (2024)Flexible loss functions for binary classification in gradient-boosted decision trees: An application to credit scoringExpert Systems with Applications10.1016/j.eswa.2023.121876238(121876)Online publication date: Mar-2024
  • (2024)User-Agnostic Model for Retweets Prediction Based on Graph-Embedding Representation of Social Neighborhood InformationInformation Management and Big Data10.1007/978-3-031-63616-5_8(107-120)Online publication date: 29-Jun-2024
  • (2023)User-Agnostic Model for Prediction of Retweets Based on Social Neighborhood InformationInformation Management and Big Data10.1007/978-3-031-35445-8_2(18-31)Online publication date: 11-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
RecSysChallenge '20: Proceedings of the Recommender Systems Challenge 2020
September 2020
49 pages
ISBN:9781450388351
DOI:10.1145/3415959
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Boosting
  2. Feature Engineering
  3. GPU Acceleration
  4. RecSys Challenge
  5. Recommender Systems

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

RecSys Challenge '20

Acceptance Rates

Overall Acceptance Rate 11 of 15 submissions, 73%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)53
  • Downloads (Last 6 weeks)5
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Flexible loss functions for binary classification in gradient-boosted decision trees: An application to credit scoringExpert Systems with Applications10.1016/j.eswa.2023.121876238(121876)Online publication date: Mar-2024
  • (2024)User-Agnostic Model for Retweets Prediction Based on Graph-Embedding Representation of Social Neighborhood InformationInformation Management and Big Data10.1007/978-3-031-63616-5_8(107-120)Online publication date: 29-Jun-2024
  • (2023)User-Agnostic Model for Prediction of Retweets Based on Social Neighborhood InformationInformation Management and Big Data10.1007/978-3-031-35445-8_2(18-31)Online publication date: 11-Jun-2023
  • (2022)WISCANet: A Rapid Development Platform for Beyond 5G and 6G Radio System PrototypingSignals10.3390/signals30400413:4(682-707)Online publication date: 9-Oct-2022
  • (2022)Reducing the Friction for Building Recommender Systems with MerlinProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3542633(4816-4817)Online publication date: 14-Aug-2022
  • (2022)Toward a Cognitive-Inspired Hashtag Recommendation for Twitter Data AnalysisIEEE Transactions on Computational Social Systems10.1109/TCSS.2022.31698389:6(1748-1757)Online publication date: Dec-2022
  • (2022)Predicting User Engagements using Graph Neural Networks on Online Social Networks2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00339(2293-2297)Online publication date: Dec-2022
  • (2022)Understanding social engagements: A comparative analysis of user and text features in TwitterSocial Network Analysis and Mining10.1007/s13278-022-00872-112:1Online publication date: 31-Mar-2022
  • (2021)Team JKU-AIWarriors in the ACM Recommender Systems Challenge 2021: Lightweight XGBoost Recommendation Approach Leveraging User FeaturesProceedings of the Recommender Systems Challenge 202110.1145/3487572.3487874(39-43)Online publication date: 1-Oct-2021
  • (2021)GPU Accelerated Boosted Trees and Deep Neural Networks for Better Recommender SystemsProceedings of the Recommender Systems Challenge 202110.1145/3487572.3487605(7-14)Online publication date: 1-Oct-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media