Other
Transformers
Safetensors
PyTorch
kvzap
nvidia

KVzap

License GitHub KVzap collection arXiv

KVzap is a fast, adaptive, and faithful KV cache pruning method aiming to accelerate LLM inference in both prefilling and decoding. It applies a lightweight model to the hidden states to predict importance scores for every KV pair and prunes the ones with a score below a given threshold, following the Dynamic Memory Sparsification (DMS) inference strategy.

The method was introduced in the paper KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.

KVzap is trained as a fast approximation of KVzip+, using 1.2M samples from Nemotron-Pretraining-Dataset-sample. Training code is available in the kvpress repository.

Usage

KVzap can be used with the kvpress library, through the custom KVPressTextGenerationPipeline, which is automatically registered as a transformers pipeline with the name kv-press-text-generation when kvpress is imported:

import requests
from transformers import pipeline
from kvpress import KVzapPress, DMSPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
press = DMSPress(KVzapPress(model_type="mlp"), threshold=-4)

# Prefilling compression only, thinking disabled
press.decoding = False
context = requests.get("https://arxiv.org/abs/2601.07891").text
question = "\n What is this article about in 2 sentences ?"
answer = pipe(context, question=question, press=press)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

# Prefilling and decoding compression, thinking enabled
press.decoding = True
prompt = "What is the best hardware to run LLMs and why ?"
answer = pipe(prompt, press=press, enable_thinking=True, max_new_tokens=2000)["answer"]
print(f"Compression ratio: {press.compression_ratio:.2%}\nAnswer: {answer}")

Citation

If you use KVzap in your research, please cite the following paper:

@article{jegou2025kvzap,
  title={KVzap: Fast, Adaptive, and Faithful KV Cache Pruning},
  author={Jegou, Simon and Jeblick, Maximilian},
  journal={arXiv preprint arXiv:2601.07891},
  year={2025},
  url={https://arxiv.org/abs/2601.07891}
}
Downloads last month
92
Safetensors
Model size
1.18M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nvidia/KVzap-linear-Qwen3-8B

Collection including nvidia/KVzap-linear-Qwen3-8B

Papers for nvidia/KVzap-linear-Qwen3-8B