Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

Sep 26, 2025·

Yiwen Zhu

^*

,

Jinyi Liu

Jinyi Liu

^*

,

Pengjie Gu

,

Yifu Yuan

,

Zhenxing Ge

,

Wenya Wei

,

Zhou Fang

,

Yujing Hu

,

Bo An

· 0 min read

Type

Conference paper

Publication

NeurIPS 2025

Overview

A reward-modeling approach for preference-based reinforcement learning built on proximal policy exploration.

Venue. NeurIPS 2025

Last updated on Sep 26, 2025

DRL Exploration Preference Learning

Jinyi Liu

Authors

Ph.D. Candidate Reinforcement Learning and LLM Systems

← Hands-on LLM-based Agents: A Tutorial for General Audiences Nov 17, 2025

From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models Aug 1, 2025 →

Back to Publications

More Publications

Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model Jan 2026 CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis Jan 2026 From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Jan 2026