DASD-4B-Thinking-2507-stage2

DASD-4B-Thinking-2507-stage2 is the final model in a three-stage training pipeline built upon Qwen/Qwen3-4B-Thinking-2507. It combines Reinforcement Learning via GRPO with a two-stage Supervised Fine-Tuning (SFT) strategy inspired by the Distribution-Aligned Sequence Distillation (DASD) methodology introduced by Alibaba Cloud Apsara Lab, resulting in a compact 4B model with enhanced mathematical reasoning and long chain-of-thought capabilities.


🧬 Training Pipeline Overview

This model is the culmination of three sequential training stages:

Qwen/Qwen3-4B-Thinking-2507
         β”‚
         β–Ό  Stage 0: GRPO (RL on Math & Reasoning)
DASD-4B-Thinking-2507-GRPO-v2
         β”‚
         β–Ό  Stage 1: SFT with Low-Temperature (T=0.6) Distillation Data
DASD-4B-Thinking-2507-stage1
         β”‚
         β–Ό  Stage 2: SFT with Default-Temperature (T=1.0) Distillation Data
DASD-4B-Thinking-2507-stage2  ← (this model)

πŸ“š Stage Details

Stage 0 β€” GRPO Reinforcement Learning: DASD-4B-Thinking-2507-GRPO-v2

Starting from the base model Qwen/Qwen3-4B-Thinking-2507, Group Relative Policy Optimization (GRPO) was applied using a high-quality mathematical reasoning dataset distilled from DeepSeek-R1. This stage significantly improved the model's:

  • Correctness on math problem solving
  • Step-by-step logical reasoning
  • Reward signal alignment for verifiable tasks

Stage 1 β€” Low-Temperature SFT: DASD-4B-Thinking-2507-stage1

Inspired by the Distribution-Aligned Sequence Distillation (DASD) pipeline from Alibaba-Apsara, Stage 1 SFT was performed using the low-temperature subset (T=0.6) of the Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b dataset.

πŸ’‘ Why Low-Temperature Distillation for Small Models?

Low-temperature sampling from the teacher model (gpt-oss-120b) produces sharper, more deterministic output distributions, which are significantly easier for small student models to imitate and internalize. This "cold-start" strategy:

  • Reduces distributional mismatch between teacher and student β€” the cleaner, more peaked distributions generated at low temperature align better with what a small model can currently express
  • Provides a stable foundation β€” the model first learns the most consistent and representative reasoning patterns before being exposed to more diverse trajectories
  • Boosts early performance rapidly β€” low-temperature data provides an efficient jump-start for math and scientific reasoning benchmarks
  • Mitigates exposure bias β€” by gradually introducing complexity, the model avoids overfitting to noisy or outlier reasoning traces

This is the key insight behind DASD's temperature-scheduled learning: start cold for stability, then warm up for diversity.

Dataset used:


Stage 2 β€” Default-Temperature SFT: DASD-4B-Thinking-2507-stage2 (this model)

Building on DASD-4B-Thinking-2507-stage1, Stage 2 SFT was performed using the default-temperature subset (T=1.0) of the same dataset. Higher-temperature data introduces greater lexical diversity and broader mode coverage, enabling the model to generalize better across diverse reasoning patterns and problem domains.

Dataset used:


πŸ—‚οΈ All Datasets Used

Stage Dataset Purpose
GRPO (RL) a-m-team/AM-DeepSeek-R1-Distilled-1.4M Math & reasoning RL training via GRPO
SFT Stage 1 Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b (Stage 1, T=0.6) Low-temp distillation, stable cold-start
SFT Stage 2 Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b (Stage 2, T=1.0) High-temp distillation, diversity & generalization

The Superior-Reasoning-SFT-gpt-oss-120b dataset itself is built from the following upstream question sources:


πŸƒ Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Jackrong/DASD-4B-Thinking-2507-stage2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "Solve: find all real solutions to x^3 - 6x^2 + 11x - 6 = 0."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Tip: This model naturally generates <think>...</think> reasoning traces before the final answer. You can parse these to inspect the chain-of-thought.


πŸ“‹ Model Details

Attribute Value
Base Model Qwen/Qwen3-4B-Thinking-2507
Architecture Qwen3 (4B Dense)
License Apache 2.0
Language(s) English, Chinese
Training Framework Unsloth + Hugging Face TRL
RL Algorithm GRPO (Group Relative Policy Optimization)
Fine-tuning Method SFT (Two-stage temperature-scheduled distillation)
Developed by Jackrong

⚠️ Limitations & Intended Use

  • This model is intended for research and educational purposes related to reasoning and mathematical problem-solving.
  • While mathematical and logical reasoning capabilities have been enhanced, the model may still produce incorrect answers β€” always verify outputs on critical tasks.
  • The model inherits the capabilities and limitations of the underlying Qwen3-4B-Thinking-2507 architecture.
  • Not intended for deployment in high-stakes applications without additional safety evaluation.

πŸ“Ž Related Models

Model Description
Qwen/Qwen3-4B-Thinking-2507 Base model
Jackrong/DASD-4B-Thinking-2507-GRPO-v2 After GRPO RL training
Jackrong/DASD-4B-Thinking-2507-stage1 After low-temperature SFT
Jackrong/DASD-4B-Thinking-2507-stage2 This model β€” final stage

πŸ™ Acknowledgements

  • Alibaba Cloud Apsara Lab for the DASD methodology and the Superior-Reasoning-SFT-gpt-oss-120b dataset
  • AM-Team for the DeepSeek-R1 distilled dataset
  • NVIDIA for open reasoning datasets
  • Unsloth for efficient fine-tuning infrastructure
  • Qwen Team for the excellent base model
Downloads last month
1,887
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jackrong/DASD-4B-Thinking-2507-stage2-GGUF

Quantized
(88)
this model

Collection including Jackrong/DASD-4B-Thinking-2507-stage2-GGUF