Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model
ReMix brings off-policy reinforcement finetuning to LLM post-training by reusing rollout data from past policies, dramatically reducing training cost while staying competitive on math reasoning benchmarks.
Jan 5, 2026