[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3437963.3441764acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Open access

User Response Models to Improve a REINFORCE Recommender System

Published: 08 March 2021 Publication History

Abstract

Reinforcement Learning (RL) techniques have been sought after as the next-generation tools to further advance the field of recommendation research. Different from classic applications of RL, recommender agents, especially those deployed on commercial recommendation platforms, have to operate in extremely large state and action spaces, serving a dynamic user base in the order of billions, and a long-tail item corpus in the order of millions or billions. The (positive) user feedback available to train such agents is extremely scarce in retrospect. Improving the sample efficiency of RL algorithms is thus of paramount importance when developing RL agents for recommender systems. In this work, we present a general framework to augment the training of model-free RL agents with auxiliary tasks for improved sample efficiency. More specifically, we opt to add additional tasks that predict users' immediate responses (positive or negative) toward recommendations, i.e., user response modeling, to enhance the learning of the state and action representations for the recommender agents. We also introduce a tool based on gradient correlation analysis to guide the model design. We showcase the efficacy of our method in offline experiments, learning and evaluating agent policies over hundreds of millions of user trajectories. We also conduct live experiments on an industrial recommendation platform serving billions of users and tens of millions of items to verify its benefit.

References

[1]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. arXiv preprint arXiv:1705.10528 (2017).
[2]
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. 2019. Striving for Simplicity in Off-policy Deep Reinforcement Learning. arXiv preprint arXiv:1907.04543 (2019).
[3]
Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019 a. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining . 456--464.
[4]
Minmin Chen, Ramki Gummadi, Chris Harris, and Dale Schuurmans. 2019 b. Surrogate Objectives for Batch Policy Optimization in One-step Decision Making. In Advances in Neural Information Processing Systems. 8825--8835.
[5]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[6]
Corinna Cortes, Yishay Mansour, and Mehryar Mohri. 2010. Learning bounds for importance weighting. In Advances in neural information processing systems .
[7]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198.
[8]
Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. 2018. Adapting auxiliary losses using gradient similarity. arXiv preprint arXiv:1812.02224 (2018).
[9]
Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015).
[10]
Damien Ernst, Pierre Geurts, and Louis Wehenkel. 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, Vol. 6, Apr (2005).
[11]
Scott Fujimoto, David Meger, and Doina Precup. 2018. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900 (2018).
[12]
Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 198--206.
[13]
David Ha and Jürgen Schmidhuber. 2018. World models. arXiv preprint arXiv:1803.10122 (2018).
[14]
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. 2019. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning . 2555--2565.
[15]
Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. 2017. Darla: Improving zero-shot transfer in reinforcement learning. arXiv preprint arXiv:1707.08475 (2017).
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[17]
Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . 368--377.
[18]
Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, Heng-Tze Cheng, Tushar Chandra, and Craig Boutilier. 2019. SlateQ: A tractable decomposition for reinforcement learning with recommendation sets. (2019).
[19]
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. 2017. Reinforcement learning with unsupervised auxiliary tasks. In ICLR .
[20]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007 (2014).
[21]
Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et almbox. 2019. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374 (2019).
[22]
Shivaram Kalyanakrishnan and Peter Stone. 2007. Batch reinforcement learning in a complex domain. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems. ACM, 94.
[23]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et almbox. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, Vol. 114, 13 (2017).
[24]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer, Vol. 42, 8 (2009), 30--37.
[25]
Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. 2019. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. arXiv preprint arXiv:1906.00949 (2019).
[26]
Sascha Lange, Thomas Gabel, and Martin Riedmiller. 2012. Batch reinforcement learning. In Reinforcement learning . Springer, 45--73.
[27]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).
[28]
Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. 2018. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, Vol. 37, 4--5 (2018), 421--436.
[29]
Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye, Haokun Chen, Huifeng Guo, and Yuzhou Zhang. 2018. Deep reinforcement learning based recommendation with explicit user-item interactions modeling. arXiv preprint arXiv:1810.12027 (2018).
[30]
Shikun Liu, Andrew Davison, and Edward Johns. 2019. Self-supervised generalisation with meta auxiliary learning. In Advances in Neural Information Processing Systems. 1679--1689.
[31]
Taylor Mordan, Nicolas Thome, Gilles Henaff, and Matthieu Cord. 2018. Revisiting multi-task learning with rock: a deep residual auxiliary block for visual detection. In Advances in Neural Information Processing Systems. 1310--1322.
[32]
German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural Networks, Vol. 113 (2019), 54--71.
[33]
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016).
[34]
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. 285--295.
[35]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In International Conference on Machine Learning . 1889--1897.
[36]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[37]
Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. 2018. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1134--1141.
[38]
Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based recommender system. Journal of Machine Learning Research, Vol. 6, Sep (2005).
[39]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et almbox. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484.
[40]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et almbox. 2017. Mastering the game of go without human knowledge. nature, Vol. 550, 7676 (2017), 354--359.
[41]
Aravind Srinivas, Michael Laskin, and Pieter Abbeel. 2020. Curl: Contrastive unsupervised representations for reinforcement learning. arXiv preprint arXiv:2004.04136 (2020).
[42]
Richard S Sutton, Andrew G Barto, et almbox. 1998. Reinforcement learning: An introduction .MIT press.
[43]
Adith Swaminathan and Thorsten Joachims. 2015a. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, Vol. 16, 1 (2015), 1731--1755.
[44]
Adith Swaminathan and Thorsten Joachims. 2015b. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems .
[45]
Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In ICML . 2139--2148.
[46]
Shubham Toshniwal, Hao Tang, Liang Lu, and Karen Livescu. 2017. Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition. Proc. Interspeech 2017 (2017), 3532--3536.
[47]
Trieu Trinh, Andrew Dai, Thang Luong, and Quoc Le. 2018. Learning Longer-term Dependencies in RNNs with Auxiliary Losses. In International Conference on Machine Learning. 4965--4974.
[48]
Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning, Vol. 8, 3--4 (1992), 279--292.
[49]
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, Vol. 8, 3--4 (1992), 229--256.
[50]
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), Vol. 52, 1 (2019), 1--38.
[51]
Yuting Zhang, Kibok Lee, and Honglak Lee. 2016. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In International conference on machine learning . 612--621.
[52]
Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019 a. " Deep reinforcement learning for search, recommendation, and online advertising: a survey" by Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin with Martin Vesely as coordinator. ACM SIGWEB Newsletter Spring (2019), 1--15.
[53]
Xiangyu Zhao, Long Xia, Dawei Yin, and Jiliang Tang. 2019 b. Model-based reinforcement learning for whole-chain recommendations. arXiv preprint arXiv:1902.03987 (2019).
[54]
Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2017. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209 (2017).
[55]
Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A deep reinforcement learning framework for news recommendation. (2018), 167--176.

Cited By

View all
  • (2024)On the unexpected effectiveness of reinforcement learning for sequential recommendationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693918(45432-45450)Online publication date: 21-Jul-2024
  • (2024)Personalised Multi-modal Interactive Recommendation with Hierarchical State RepresentationsACM Transactions on Recommender Systems10.1145/36511692:3(1-25)Online publication date: 4-Mar-2024
  • (2024)Δ-OPE: Off-Policy Estimation with Pairs of PoliciesProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688162(878-883)Online publication date: 8-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining
March 2021
1192 pages
ISBN:9781450382977
DOI:10.1145/3437963
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. auxiliary tasks
  2. recommender systems
  3. reinforcement learning
  4. user response models

Qualifiers

  • Research-article

Conference

WSDM '21

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)386
  • Downloads (Last 6 weeks)27
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)On the unexpected effectiveness of reinforcement learning for sequential recommendationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693918(45432-45450)Online publication date: 21-Jul-2024
  • (2024)Personalised Multi-modal Interactive Recommendation with Hierarchical State RepresentationsACM Transactions on Recommender Systems10.1145/36511692:3(1-25)Online publication date: 4-Mar-2024
  • (2024)Δ-OPE: Off-Policy Estimation with Pairs of PoliciesProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688162(878-883)Online publication date: 8-Oct-2024
  • (2024)Multi-Objective Recommendation via Multivariate Policy LearningProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688132(712-721)Online publication date: 8-Oct-2024
  • (2024)Optimal Baseline Corrections for Off-Policy Contextual BanditsProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688105(722-732)Online publication date: 8-Oct-2024
  • (2024)Future Impact Decomposition in Request-level RecommendationsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671506(5905-5916)Online publication date: 25-Aug-2024
  • (2024)Explicitly Integrating Judgment Prediction with Legal Document Retrieval: A Law-Guided Generative ApproachProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657717(2210-2220)Online publication date: 10-Jul-2024
  • (2024)User Response Modeling in Reinforcement Learning for Ads AllocationCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648310(131-140)Online publication date: 13-May-2024
  • (2024)A Survey on Reinforcement Learning for Recommender SystemsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328016135:10(13164-13184)Online publication date: Oct-2024
  • (2024)CIPPO: Contrastive Imitation Proximal Policy Optimization for Recommendation Based on Reinforcement LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340264936:11(5753-5767)Online publication date: Nov-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media