Llama-3.2-1B-Instruct - ADPQ 4-bit Quantized

This work is part of a master thesis. The library used for quantization is available at auto-adpq.

pip install auto-adpq

Model Description

This is a compressed version of meta-llama/Llama-3.2-1B-Instruct created using 4-bit quantization.

This model was quantized to reduce VRAM usage and increase inference speed while maintaining majority of the original model's performance.

Quantization Details

Original Model: meta-llama/Llama-3.2-1B-Instruct
Quantization Method: ADPQ (Adaptive Quantization with data-free calibration)
Precision: 4-bit
Simulated: Yes

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Tfloow/Llama-3.2-1B-Instruct-adpq-4bit-sim"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Performance

Model Variant	Quantization Method	PPL (Perplexity)
meta-llama/Llama-3.1-8B	Baseline	4.8693
	BNB	5.0733
	AdpQ	5.3671
meta-llama/Llama-3.1-8B-Instruct	Baseline	4.9080
	BNB	4.9993
	AdpQ	5.0069
	AWQ	5.0440
	GPTQ	nan
meta-llama/Llama-3.2-1B	Baseline	6.5546
	AdpQ 9%	6.9491
	BNB	6.9971
	AdpQ 2%	7.0380
meta-llama/Llama-3.2-3B-Instruct	Baseline	5.7864
	AWQ	5.8339
	AdpQ	5.9040

How was the model quantized?


import torch
from transformers import AutoModelForCausalLM

from auto_adpq import Auto_AdpQ, AutoAdpQConfig

model_name = "meta-llama/Llama-3.2-1B-Instruct"

# Setup Auto-AdpQ configuration
adpq_config = AutoAdpQConfig(
    group_size=group_size,
    n_iters=250,  # Throw UserWarning if too low
    alpha=0.05,   # The higher, the better the PPL loss but higher overhead
    device="cpu",
    q_bit=4,
    data_packing=False,
    symmetrical_quantization=True,
)

user = "Tfloow"
adpq_model_name = f"{user}/{model_name.split('/')[-1]}-adpq-4bit-sim"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# virtual quantization
quantized = Auto_AdpQ.apply_quantization(model, adpq_config, multi_threaded=16)

model.push_to_hub(adpq_model_name)

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tfloow/Llama-3.2-1B-Instruct-adpq-4bit-sim

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1372)

this model

Collection including Tfloow/Llama-3.2-1B-Instruct-adpq-4bit-sim

ADPQ

Collection

ADPQ quantized weights using auto-adpq library • 7 items • Updated Dec 13, 2025