rapidfire-ai-inc/Mistral-7B-Instruct-v0.3-bnb-4bit

4-bit quantized (bitsandbytes) instruct model based on mistralai/Mistral-7B-Instruct-v0.3, fine-tuned with QLoRA on a 10% sample of HuggingFaceH4/ultrachat_200k for supervised fine-tuning (SFT).

TL;DR

  • Base model: mistralai/Mistral-7B-Instruct-v0.3
  • Quantization: 4-bit bitsandbytes (NF4 + double quant; bfloat16 compute)
  • PEFT: QLoRA; LoRA applied to attention & MLP: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Training: SFT on UltraChat 200k (10% sample) for 5 epochs
  • Seq length: 2048
  • Optimizer: adamw_8bit, cosine LR, warmup 10%
  • Effective batch: per-device 2 × grad-accum 4
  • Precision: bf16 compute

Intended use & limitations

Use cases. General assistant/chat and instruction following in English. The model is suitable for helpful, safe, concise responses in everyday tasks.

Limitations. May produce inaccurate or biased content and lacks built-in moderation. Do not use for high-risk domains without additional safety layers or human review.


Quickstart (Transformers + bitsandbytes)

Requires transformers, accelerate, bitsandbytes, and a recent CUDA build for 4-bit inference.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "rapidfire-ai-inc/Mistral-7B-Instruct-v0.3-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Explain diffusion models in simple terms."}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
)
print(tok.decode(out[0], skip_special_tokens=True))

Training details

Data

  • Dataset: HuggingFaceH4/ultrachat_200k
  • Sampling: 10% subset used for SFT before any DPO alignment.

Method

  • Approach: QLoRA (parameter-efficient fine-tuning on a 4-bit base)
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Hyperparameters

max_length = 2048
per_device_train_batch_size = 2
gradient_accumulation_steps = 4
learning_rate = 2e-5
warmup_ratio = 0.1
weight_decay = 0.001
lr_scheduler_type = "cosine"
optim = "adamw_8bit"
bf16 = True
num_train_epochs = 5

LoRA configuration

LoraConfig(
    task_type="CAUSAL_LM",
    r=64,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none",
)

BitsAndBytes (4-bit) config

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

Inference tips

  • Keep torch_dtype=torch.bfloat16 with 4-bit to balance speed/quality.
  • Start with: max_new_tokens=256, temperature=0.6–0.9, top_p=0.9, repetition_penalty=1.1–1.2.
  • Use the tokenizer’s chat template (apply_chat_template) to ensure proper formatting.

Responsible AI & safety

This model can generate incorrect or harmful text. Add safety filters and human oversight for production deployments. Please report issues via the model repo.


License

Apache-2.0. Also comply with the base model’s license and usage terms.


Acknowledgements

  • Base model: Mistral-7B-Instruct-v0.3 by Mistral AI.
  • Dataset: UltraChat 200k by Hugging Face H4.

Citation

@misc{rapidfireai_mistral7b_bnb4bit_2025,
  title        = {Mistral-7B-Instruct-v0.3-bnb-4bit (RapidFire AI)},
  author       = {RapidFire AI, Inc.},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/rapidfire-ai-inc/Mistral-7B-Instruct-v0.3-bnb-4bit}}
}

Changelog

  • v1.0 — Initial release: 4-bit quantized checkpoint with QLoRA SFT on UltraChat 200k (10% sample).
Downloads last month
2
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rapidfire-ai-inc/Mistral-7B-Instruct-v0.3-bnb-4bit

Adapter
(523)
this model

Dataset used to train rapidfire-ai-inc/Mistral-7B-Instruct-v0.3-bnb-4bit