rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit

4-bit NF4 quantized version of Llama 3.1 8B for convenient QLoRA training and efficient inference.

TL;DR

Base model: meta-llama/Llama-3.1-8B
Quantization: 4-bit bitsandbytes (NF4 + double quant; bfloat16 compute)
Purpose: Ready-to-use base for QLoRA fine-tuning; also suitable for lightweight inference
Suggested dtype: torch.bfloat16 compute with 4-bit weights

Quickstart (Transformers + bitsandbytes)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Write a haiku about GPUs."}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.9,
)
print(tok.decode(out[0], skip_special_tokens=True))

BitsAndBytes (4-bit) config

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

Intended use & limitations

Use cases. A compact, QLoRA-ready starting point for supervised fine-tuning (SFT) or preference tuning, plus low-memory inference.

Limitations. Inherits all behaviors and restrictions of meta-llama/Llama-3.1-8B. May produce inaccurate or biased content. Do not deploy in high‑risk settings without safeguards.

License. This repository follows the llama3 terms and the upstream model’s license and acceptable‑use policies.

Notes

Trained weights are unchanged aside from quantization; no additional fine‑tuning was performed.
Use apply_chat_template if the upstream tokenizer provides a chat template.
For best throughput on a single GPU, keep torch_dtype=torch.bfloat16 and load_in_4bit=True.

Citation

@misc{rapidfireai_Llama_3.1_8B_bnb_4bit_bnb4bit_2025,
  title        = {Llama-3.1-8B-bnb-4bit (RapidFire AI)},
  author       = {RapidFire AI, Inc.},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit}}
}

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

F32

Model tree for rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit

Base model

meta-llama/Llama-3.1-8B

Quantized

(297)

this model