RLVR

Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model

ReMix brings off-policy reinforcement finetuning to LLM post-training by reusing rollout data from past policies, dramatically reducing training cost while staying competitive on math reasoning benchmarks.

Jan 5, 2026

Beyond Scalar Critics: A Distributional Perspective on Reinforcement Learning with Verifiable Rewards for LLMs

DistRLVR is a distributional RL framework for LLM post-training with verifiable rewards that models token-level return distributions and uses tail-aware advantages to improve sample efficiency and reasoning performance.

Jan 1, 2026