ADPQ
Collection
ADPQ quantized weights using auto-adpq library
•
7 items
•
Updated
This work is part of a master thesis. The library used for quantization is available at auto-adpq.
pip install auto-adpq
This is a compressed version of meta-llama/Llama-3.2-1B-Instruct created using 4-bit quantization.
This model was quantized to reduce VRAM usage and increase inference speed while maintaining majority of the original model's performance.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Tfloow/Llama-3.2-1B-Instruct-adpq-4bit-sim"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(output[0]))
| Model Variant | Quantization Method | PPL (Perplexity) |
|---|---|---|
| meta-llama/Llama-3.1-8B | Baseline | 4.8693 |
| BNB | 5.0733 | |
| AdpQ | 5.3671 | |
| meta-llama/Llama-3.1-8B-Instruct | Baseline | 4.9080 |
| BNB | 4.9993 | |
| AdpQ | 5.0069 | |
| AWQ | 5.0440 | |
| GPTQ | nan | |
| meta-llama/Llama-3.2-1B | Baseline | 6.5546 |
| AdpQ 9% | 6.9491 | |
| BNB | 6.9971 | |
| AdpQ 2% | 7.0380 | |
| meta-llama/Llama-3.2-3B-Instruct | Baseline | 5.7864 |
| AWQ | 5.8339 | |
| AdpQ | 5.9040 |
import torch
from transformers import AutoModelForCausalLM
from auto_adpq import Auto_AdpQ, AutoAdpQConfig
model_name = "meta-llama/Llama-3.2-1B-Instruct"
# Setup Auto-AdpQ configuration
adpq_config = AutoAdpQConfig(
group_size=group_size,
n_iters=250, # Throw UserWarning if too low
alpha=0.05, # The higher, the better the PPL loss but higher overhead
device="cpu",
q_bit=4,
data_packing=False,
symmetrical_quantization=True,
)
user = "Tfloow"
adpq_model_name = f"{user}/{model_name.split('/')[-1]}-adpq-4bit-sim"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# virtual quantization
quantized = Auto_AdpQ.apply_quantization(model, adpq_config, multi_threaded=16)
model.push_to_hub(adpq_model_name)
Base model
meta-llama/Llama-3.2-1B-Instruct