Computer Science > Machine Learning

arXiv:2305.17400 (cs)

[Submitted on 27 May 2023 (v1), last revised 5 Jul 2024 (this version, v3)]

Title:Query-Policy Misalignment in Preference-Based Reinforcement Learning

Authors:Xiao Hu, Jianxiong Li, Xianyuan Zhan, Qing-Shan Jia, Ya-Qin Zhang

Abstract:Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.

Comments:	Accepted by ICLR 2024
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2305.17400 [cs.LG]
	(or arXiv:2305.17400v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2305.17400

Submission history

From: Xiao Hu [view email]
[v1] Sat, 27 May 2023 07:55:17 UTC (29,881 KB)
[v2] Thu, 23 Nov 2023 16:27:42 UTC (29,881 KB)
[v3] Fri, 5 Jul 2024 14:26:21 UTC (41,531 KB)

Computer Science > Machine Learning

Title:Query-Policy Misalignment in Preference-Based Reinforcement Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Query-Policy Misalignment in Preference-Based Reinforcement Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators