Reinforcement Learning from Human Feedback (RLHF) is a leading technique for aligning large language models (LLMs) with human preferences. However, RLHF often faces a prevalent issue known as overoptimization. This occurs when an optimized LLM generates responses that achieve high reward scores but are ultimately misaligned with human preferences. To address this, we propose Uncertainty-Penalized RLHF (UP-RLHF), a novel framework that incorporates two forms of regularization: uncertainty from reward models and Kullback-Leibler (KL) divergence from the initial policy model. A common method for quantifying uncertainty is to use an ensemble of models. Yet, directly applying ensemble methods to LLM-based reward models is parameter-inefficient and often suffers from a lack of diversity among its members. To overcome these limitations, we introduce a diversified ensemble of low-rank adaptations (LoRA) for reward modeling. This approach provides a parameter-efficient and effective way to quantify reward uncertainty. We conducted extensive experiments on two human preference datasets and one mathematical task. Our evaluation of the reward models demonstrates two key findings: encouraging diversity is crucial for LoRA ensembles, and our diversified LoRA ensembles effectively quantify uncertainty. This method improved the OOD AUROC metric by 44 % for OPT-330M and 31 % for Llama-2-7B, compared to standard LoRA ensembles under identical settings. By integrating this uncertainty regularization, UP-RLHF prevents the LLM policy from producing overestimated, low-quality content. Consequently, our framework mitigates overoptimization and enhances alignment performance. In evaluations, LLMs trained with UP-RLHF outperformed those trained with vanilla RLHF, achieving a 12 % improvement on a summarization task and a 56 % GPT-4 judged win rate on a helpful dialogue task.
扫码关注我们
求助内容:
应助结果提醒方式:
