File size: 4,797 Bytes

---
license: apache-2.0
---

# ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations
[![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2505.02819)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)


![ReplaceMe Logo](./figs/logo2.jpg)

## Model Description
ReplaceMe is a novel method for transformer model compression that enables **training-free** block/layer pruning while maintaining model performance through linear transformations. The approach:

- Identifies and removes block of layers
- Applies mathematically-derived transformations to preserve information flow
- Requires no fine-tuning or retraining
- Works with standard transformer architectures (The LTs are merged with the original model weights)

## Key Features
- 🚀 **Zero-Training Pruning**: Remove layers without any fine-tuning
- 🧠 **Performance Preservation**: <8% accuracy drop in most cases
- ⚡ **Instant Speedup**: less blocks -> faster inference + less memory
- 🔌 **Plug-and-Play**: Works with existing HuggingFace models

## 🔥 Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression)

| Method           | Train-Free? |  C3  | CMNLI | CHID (test) | WSC  | HellaSwag | PIQA | Race-M | Race-H | MMLU | CMMLU | **AVG** | **RP**  |
|------------------|-------------|------|-------|-------------|------|-----------|------|--------|--------|------|-------|---------|---------|
| **Llama 2 7B** (baseline) |  | 43.8 | 33.0  | 41.6        | 37.5 | 71.3      | 78.1 | 33.1   | 35.5   | 46.8 | 31.8  | 45.3    | 100.0%  |
| **LLM-Streamline*** | ❌ | 🏆 43.3 | 33.0  | 24.1        | 36.5 | 🏆 61.1   | 🏆 71.5 | 34.8   | 37.0   | 45.5 | 29.4  | 41.6    | 92.0%   |
| **LLMPruner***      | ❌ | 29.7 | 33.4  | 28.4        | 40.4 | 54.6      | 72.0 | 22.9   | 22.0   | 25.3 | 25.0  | 35.4    | 78.2%   |
| **SliceGPT***       | ❌ | 31.5 | 31.6  | 18.5        | 43.3 | 47.5      | 68.3 | 27.0   | 29.4   | 28.8 | 24.8  | 35.1    | 77.5%   |
| **LaCo***           | ❌ | 39.7 | 🏆 34.4 | 🏆 36.1     | 40.4 | 55.7      | 69.8 | 23.6   | 22.6   | 26.5 | 25.2  | 37.4    | 82.7%   |
| **UIDL***           | ❌ | 40.2 | 🏆 34.4 | 21.5        | 40.4 | 59.7      | 69.0 | 35.2   | 34.7   | 44.6 | 28.9  | 40.9    | 90.3%   |
|                  |             |      |       |             |      |           |      |        |        |      |       |         |         |
| **ReplaceMe (this model)**  | ✅ | 42.5 | 33.0  | 25.2        | 38.5 | 59.4      | 71.1 | 35.4   | 🏆 36.7 | 🏆 46.4 | 🏆 30.4 | 🏆 **41.9** | 🏆 **92.5%** |

**Key:**
- 🏆 Best performance in column
- ✅ Training-free (our methods)
- ❌ Requires training
- *Numbers taken from Streamline paper

**Metrics Explained:**
- **RP**: Relative Performance (% of baseline)
- **Bold**: Best training-free results
- All numbers are accuracy scores

> 🔥 **Our training-free methods achieve 92.5% of baseline performance while other approaches require expensive retraining!**

## Installation
```bash
pip install replaceme
# or
git clone https://github.com/mts-ai/ReplaceMe
cd ReplaceMe
pip install -e .
```
## Basic Usage
```
# LSTSQ method (recommended)
run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml

# Cosine similarity method
run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml
```
There are many parameters you can play with, visit our repo and dscover 🔥🔥
## Load Model
As we said we are merging the LTs with the original transformer architecture so you just do it as usual
```python
## EXAMPLE
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MTSAIR/Llama2-5B-ReplaceMe"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is ReplaceME pruning method?!"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    **model_inputs,
    max_new_tokens=512
)
response = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

```
# Citation
If you use ReplaceMe in your research, please cite our paper:

```bibtex
@article{shopkhoev2025replaceme0,
  title   = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations},
  author  = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2505.02819}
}
```