Model Card for Model ID
This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-4B-Base model using the MaxRL objective, "Maximum Likelihood Reinforcement Learning". In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.
Model Details
Model Description
This is the model card of a Qwen3/Qwen3-4B-Base model fine-tuned using MaxRL.
- Finetuned from model: Qwen3/Qwen3-4B-Base
Model Sources
- Repository: Official Code Release for the paper "Maximum Likelihood Reinforcement Learning"
- Paper: Maximum Likelihood Reinforcement Learning
- Project Website: Project Website
Training Details
Training Data
We train on the POLARIS-53K dataset to produce this checkpoint.
Training Procedure
Please use the given script or in general the published codebase to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.
Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.
Hardware
This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).
Citation
BibTeX:
@misc{tajwar2026maximumlikelihoodreinforcementlearning,
title={Maximum Likelihood Reinforcement Learning},
author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
year={2026},
eprint={2602.02710},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.02710},
}
Model Card Contact
- Downloads last month
- 20
Collection including ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps
Collection
Qwen3-Base post-trained checkpoints for our paper, Maximum Likelihood Reinforcement Learning [https://zanette-labs.github.io/MaxRL/] • 4 items • Updated
• 2