Model Card for Model ID

This is a saved checkpoint from fine-tuning a Qwen3/Qwen3-4B-Base model using the MaxRL objective, "Maximum Likelihood Reinforcement Learning". In our work, we introduce MaxRL, a framework for optimizing maximum likelihood in RL settings.

Model Details

Model Description

This is the model card of a Qwen3/Qwen3-4B-Base model fine-tuned using MaxRL.

Model Sources

Training Details

Training Data

We train on the POLARIS-53K dataset to produce this checkpoint.

Training Procedure

Please use the given script or in general the published codebase to reproduce training this checkpoint. Hyperparameters and other details are provided in the training script.

Due to computational constraints, we have trained for 1000 steps, and released the final checkpoint.

Hardware

This model has been finetuned using 32 NVIDIA H200 GPUs (4 nodes of 8xH200 GPUs).

Citation

BibTeX:

@misc{tajwar2026maximumlikelihoodreinforcementlearning,
      title={Maximum Likelihood Reinforcement Learning}, 
      author={Fahim Tajwar and Guanning Zeng and Yueer Zhou and Yuda Song and Daman Arora and Yiding Jiang and Jeff Schneider and Ruslan Salakhutdinov and Haiwen Feng and Andrea Zanette},
      year={2026},
      eprint={2602.02710},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.02710}, 
}

Model Card Contact

Fahim Tajwar

Downloads last month
20
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps

Paper for ftajwar/qwen3_4B_Base_MaxRL_Polaris_1000_steps