GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published 1 day ago • 102
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits Paper • 2512.20578 • Published 17 days ago • 68
On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral Paper • 2512.04220 • Published Dec 3, 2025 • 13
Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models Paper • 2512.13607 • Published 25 days ago • 30
view article Article Apriel-1.6-15b-Thinker: Cost-efficient Frontier Multimodal Performance Dec 9, 2025 • 82
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano v3. • 7 items • Updated 17 days ago • 56
Tiny-A2D Collection Small diffusion language models adapted from AR models • 4 items • Updated Dec 6, 2025 • 13
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices Paper • 2512.01374 • Published Dec 1, 2025 • 96