Learning to Reason as Action Abstractions with Scalable Mid-Training RL Paper • 2509.25810 • Published Sep 30, 2025 • 6
Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning Paper • 2505.20561 • Published May 26, 2025 • 7
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer Paper • 2405.16436 • Published May 26, 2024 • 1
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs Paper • 2410.08067 • Published Oct 10, 2024 • 2
DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs Paper • 2411.13611 • Published Nov 20, 2024
Offline Reinforcement Learning for LLM Multi-Step Reasoning Paper • 2412.16145 • Published Dec 20, 2024 • 38
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment Paper • 2405.19332 • Published May 29, 2024 • 22