RB-QWEN-3B-16DS-LORA

ReasonBorn v3 is a specialized LoRA (Low-Rank Adaptation) for Qwen2.5-3B, fine-tuned for structured multi-step reasoning. It is trained to decompose complex problems into a specific XML-based reasoning chain (Plan → Step-by-Step Reasoning → Verified Conclusion).

This version was trained using the "MI300X Bulletproof Edition v3" pipeline, optimized for high-throughput compute on AMD Instinct MI300X (192GB) hardware.

🚀 Model Details

Developed by: Soham (Phase-Technologies / Xerv-AI)
Model Type: PeftAdapter (LoRA)
Base Model: Qwen/Qwen2.5-3B
Training Architecture: Causal Language Modeling with structured COT (Chain of Thought).
Format: ChatML with custom XML reasoning tags.

🧠 Reasoning Format

The model is strictly formatted to follow the ReasonBorn protocol. It will wrap its thoughts in the following structure:

<|im_start|>system
You are ReasonBorn. Output only: <plan>,<reasoning><step>...</step></reasoning>,<conclusion>\boxed{}.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
<plan>Decompose→reason→verify→conclude.</plan>
<reasoning>
<step index="1">Observation and initial approach.</step>
<step index="2">Calculation or logical derivation.</step>
...
<step index="n">Final verification.<verify>ok</verify></step>
</reasoning>
<conclusion>\boxed{result}</conclusion><|im_end|>

📊 Training Data Mix (16DS Mixture)

The model was trained on a curated mix of ~200k samples across 11 high-quality datasets to balance math, science, and general logic:

Dataset	Samples	Focus
NuminaMath-CoT	60,000	Advanced Math Reasoning
OrcaMath	60,000	Word Problems
UltraMath-Conv	50,000	Synthetic Conversation Math
SciQ / OpenBookQA	~16,000	General Science
GSM8K	7,473	Grade School Math
AI2_ARC (Challenge)	7,500	Hard Science Questions
Xerv-AI/GRAD	1,933	Graduate Level Maths
GPQA / HLE / ChemQA	~7,000	Expert-level Logic & Chemistry

⚙️ Training Hyperparameters

Optimized for the AMD MI300X for sub-6-hour completion:

GPU: 1x AMD MI300X 192GB
Precision: bf16 (BFloat16)
Optimizer: adamw_torch_fused
Learning Rate: 2.5e-4 (Cosine schedule)
Epochs: 1.15
Batch Size: 48 (Global Batch: 96 with Grad Accum 2)
Max Context: 512 tokens
LoRA Config: - Rank (R): 16
Alpha: 32
Target Modules: All Linear Layers (q, k, v, o, gate, up, down_proj)
Dropout: 0.05

🛠️ Usage (Inference)

Since this is a LoRA adapter, you must load the base model first:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# 1. Setup IDs
base_model_id = "Qwen/Qwen2.5-3B"
adapter_id = "Xerv-AI/rb-qwen3b-16ds-lora"

# 2. Load Tokenizer and Base Model
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16, # Or float16 if no BF16 support
    device_map="auto"
)

# 3. Load your LoRA adapter
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

# 4. Prepare the prompt (matching your specific ReasonBorn template)
question = "Solve x^3 - 6x^2 + 11x - 6 = 0 for real roots."
prompt = (
    "<|im_start|>system\n"
    "You are ReasonBorn. Output only: <plan>,<reasoning><step>...</step></reasoning>,<conclusion>\\boxed{}.\n"
    "<|im_end|>\n"
    f"<|im_start|>user\n{question}<|im_end|>\n"
    "<|im_start|>assistant\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# 5. Generate
with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# 6. Decode
result = tokenizer.decode(output[0], skip_special_tokens=True)
print(result.split("assistant\n")[-1])

Training Script

import os
import gc
import re
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

import torch
from huggingface_hub import login, HfApi
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model

os.environ["TOKENIZERS_PARALLELISM"] = "false"

MODEL_ID   = "Qwen/Qwen2.5-3B"
REPO_NAME  = "rb-qwen3b-16ds-lora"
SAVE_DIR   = "./rb-qwen-16ds-lora-final"

MAX_CTX    = 512
EPOCHS     = 1.15
LR         = 2.5e-4
LORA_R     = 16
LORA_ALPHA = 32
BATCH_SIZE = 48
GRAD_ACCUM = 2
WORKERS    = 12

DATA_MIX = {
    "NuminaMath":     {"path": "AI-MO/NuminaMath-CoT",                                                        "max_samples": 60000, "split": "train"},
    "OrcaMath":       {"path": "microsoft/orca-math-word-problems-200k",                                      "max_samples": 60000, "split": "train"},
    "UltraMath-Conv": {"path": "openbmb/UltraData-Math", "config": "UltraData-Math-L3-Conversation-Synthetic","max_samples": 50000, "split": "train"},
    "GSM8K":          {"path": "openai/gsm8k",           "config": "main",                                    "max_samples": 7473,  "split": "train"},
    "AI2_ARC":        {"path": "allenai/ai2_arc",        "config": "ARC-Challenge",                           "max_samples": 7500,  "split": "train"},
    "SciQ":           {"path": "sciq",                                                                        "max_samples": 11679, "split": "train"},
    "OpenBookQA":     {"path": "openbookqa",                                                                  "max_samples": 4957,  "split": "train"},
    "GPQA":           {"path": "Idavidrein/gpqa",        "config": "gpqa_diamond",                            "max_samples": 198,   "split": "train"},
    "ChemistryQA":    {"path": "avaliev/ChemistryQA",                                                         "max_samples": 4000,  "split": "train"},
    "HLE":            {"path": "cais/hle",                                                                    "max_samples": 2700,  "split": "test"},
    "GRAD":           {"path": "Xerv-AI/GRAD",                                                                "max_samples": 1933,  "split": "train"},
}

def format_example(ex):
    try:
        q = str(ex.get("question") or ex.get("problem") or ex.get("prompt") or "").strip()
        s = str(ex.get("answer")   or ex.get("solution") or ex.get("response") or "").strip()
        if len(q) < 5 or len(s) < 5:
            return None
        boxed     = re.search(r'\\boxed\{(.*?)\}', s, re.DOTALL)
        ans       = boxed.group(1).strip() if boxed else s[:80]
        reasoning = re.sub(r'\\boxed\{.*?\}', '', s, flags=re.DOTALL).strip()
        steps     = [l.strip() for l in reasoning.split('\n') if len(l.strip()) > 8][:5]
        xml = "<plan>Decompose→reason→verify→conclude.</plan>\n<reasoning>\n"
        for i, step in enumerate(steps, 1):
            v = "<verify>ok</verify>" if i == len(steps) else ""
            xml += f'<step index="{i}">{step}{v}</step>\n'
        xml += f"</reasoning>\n<conclusion>\\boxed{{{ans}}}</conclusion>"
        sys_p = "You are ReasonBorn. Output only: <plan>,<reasoning><step>...</step></reasoning>,<conclusion>\\boxed{}."
        return {"text": (
            f"<|im_start|>system\n{sys_p}<|im_end|>\n"
            f"<|im_start|>user\n{q}<|im_end|>\n"
            f"<|im_start|>assistant\n{xml}<|im_end|>"
        )}
    except Exception:
        return None

def load_one(name, cfg):
    examples = []
    kwargs   = {"split": cfg["split"], "trust_remote_code": True}
    if "config" in cfg:
        kwargs["name"] = cfg["config"]
    try:
        ds = load_dataset(cfg["path"], **kwargs)
        if len(ds) > cfg["max_samples"]:
            ds = ds.select(range(cfg["max_samples"]))
        for ex in ds:
            r = format_example(ex)
            if r:
                examples.append(r)
        return name, examples, "ok"
    except Exception:
        pass
    try:
        ds = load_dataset(cfg["path"], streaming=True, **kwargs)
        for ex in ds:
            if len(examples) >= cfg["max_samples"]:
                break
            r = format_example(ex)
            if r:
                examples.append(r)
        return name, examples, "stream"
    except Exception:
        return name, [], "failed"

login()  

all_ex = []
with ThreadPoolExecutor(max_workers=6) as pool:
    futs = {pool.submit(load_one, n, c): n for n, c in DATA_MIX.items()}
    for fut in as_completed(futs):
        n, exs, status = fut.result()
        all_ex.extend(exs)

train_ds = Dataset.from_list(all_ex).shuffle(seed=42)
del all_ex
gc.collect()

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.pad_token    = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenized = train_ds.map(
    lambda b: tokenizer(b["text"], truncation=True, max_length=MAX_CTX, padding=False),
    batched=True, batch_size=4000, num_proc=16,
    remove_columns=["text"],
)
tokenized = tokenized.filter(lambda x: len(x["input_ids"]) >= 8, num_proc=16)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    attn_implementation="eager",
)
model = model.to("cuda")
torch.cuda.synchronize()

model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
model.enable_input_require_grads()

model = get_peft_model(model, LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
))

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

args = TrainingArguments(
    output_dir                  = "./chk",
    num_train_epochs            = EPOCHS,
    per_device_train_batch_size = BATCH_SIZE,
    gradient_accumulation_steps = GRAD_ACCUM,
    gradient_checkpointing      = True,
    optim                       = "adamw_torch_fused",
    learning_rate               = LR,
    bf16                        = True,
    fp16                        = False,
    logging_steps               = 25,
    save_strategy               = "steps",
    save_steps                  = 500,
    save_total_limit            = 2,
    warmup_ratio                = 0.05,
    lr_scheduler_type           = "cosine",
    weight_decay                = 0.01,
    max_grad_norm               = 0.5,
    dataloader_num_workers      = WORKERS,
    dataloader_pin_memory       = True,
    dataloader_prefetch_factor  = 4,
    report_to                   = "none",
    remove_unused_columns       = True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized,
    data_collator=collator,
)

trainer.train()

os.makedirs(SAVE_DIR, exist_ok=True)
trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

⚠️ Limitations

This model is a 3B parameter model and may hallucinate on extremely complex mathematical proofs. It is strictly optimized for the XML reasoning format; using standard chat templates may result in lower performance.

💡 Technical Note: Early Stopping & Convergence

This model was originally scheduled for a full 2-epoch run. However, due to a kernel disconnection on the MI300X training environment, the training was interrupted at Step 1000 (approximately 1.6 epochs).

Upon internal benchmarking and inference testing, this "early-stopped" checkpoint (Step 1000) demonstrated:

Generalization: The model successfully avoided the "parrotting" phase often seen in later epochs of LoRA fine-tuning.
Format Adherence: Despite the interruption, the model reached full convergence on the ReasonBorn XML tag structure.
Mathematical Integrity: Fidelity reasoning on multi-step calculations.

We have elected to release this checkpoint as the final release.

Downloads last month: 131

Model tree for Xerv-AI/rb-qwen3b-16ds-lora

Base model

Qwen/Qwen2.5-3B

Adapter

(418)

this model

Xerv-AI
/

rb-qwen3b-16ds-lora