rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit

4-bit NF4 quantized version of Llama 3.1 8B for convenient QLoRA training and efficient inference.

TL;DR

  • Base model: meta-llama/Llama-3.1-8B
  • Quantization: 4-bit bitsandbytes (NF4 + double quant; bfloat16 compute)
  • Purpose: Ready-to-use base for QLoRA fine-tuning; also suitable for lightweight inference
  • Suggested dtype: torch.bfloat16 compute with 4-bit weights

Quickstart (Transformers + bitsandbytes)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Write a haiku about GPUs."}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.9,
)
print(tok.decode(out[0], skip_special_tokens=True))

BitsAndBytes (4-bit) config

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

Intended use & limitations

Use cases. A compact, QLoRA-ready starting point for supervised fine-tuning (SFT) or preference tuning, plus low-memory inference.

Limitations. Inherits all behaviors and restrictions of meta-llama/Llama-3.1-8B. May produce inaccurate or biased content. Do not deploy in high鈥憆isk settings without safeguards.

License. This repository follows the llama3 terms and the upstream model鈥檚 license and acceptable鈥憉se policies.


Notes

  • Trained weights are unchanged aside from quantization; no additional fine鈥憈uning was performed.
  • Use apply_chat_template if the upstream tokenizer provides a chat template.
  • For best throughput on a single GPU, keep torch_dtype=torch.bfloat16 and load_in_4bit=True.

Citation

@misc{rapidfireai_Llama_3.1_8B_bnb_4bit_bnb4bit_2025,
  title        = {Llama-3.1-8B-bnb-4bit (RapidFire AI)},
  author       = {RapidFire AI, Inc.},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit}}
}
Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
F32
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for rapidfire-ai-inc/Llama-3.1-8B-bnb-4bit

Quantized
(297)
this model