--- license: apache-2.0 --- # ReplaceMe: Training-Free Transformer Pruning via Layer Removal & Linear Transformations [![arXiv](https://img.shields.io/badge/arXiv-2310.12345-b31b1b.svg)](https://arxiv.org/abs/2505.02819) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![ReplaceMe Logo](./figs/logo2.jpg) ## Model Description ReplaceMe is a novel method for transformer model compression that enables **training-free** block/layer pruning while maintaining model performance through linear transformations. The approach: - Identifies and removes block of layers - Applies mathematically-derived transformations to preserve information flow - Requires no fine-tuning or retraining - Works with standard transformer architectures (The LTs are merged with the original model weights) ## Key Features - 🚀 **Zero-Training Pruning**: Remove layers without any fine-tuning - 🧠 **Performance Preservation**: <8% accuracy drop in most cases - ⚡ **Instant Speedup**: less blocks -> faster inference + less memory - 🔌 **Plug-and-Play**: Works with existing HuggingFace models ## 🔥 Performance Comparison of Pruning Methods (Llama 2 7B, 25% Compression) | Method | Train-Free? | C3 | CMNLI | CHID (test) | WSC | HellaSwag | PIQA | Race-M | Race-H | MMLU | CMMLU | **AVG** | **RP** | |------------------|-------------|------|-------|-------------|------|-----------|------|--------|--------|------|-------|---------|---------| | **Llama 2 7B** (baseline) | | 43.8 | 33.0 | 41.6 | 37.5 | 71.3 | 78.1 | 33.1 | 35.5 | 46.8 | 31.8 | 45.3 | 100.0% | | **LLM-Streamline*** | ❌ | 🏆 43.3 | 33.0 | 24.1 | 36.5 | 🏆 61.1 | 🏆 71.5 | 34.8 | 37.0 | 45.5 | 29.4 | 41.6 | 92.0% | | **LLMPruner*** | ❌ | 29.7 | 33.4 | 28.4 | 40.4 | 54.6 | 72.0 | 22.9 | 22.0 | 25.3 | 25.0 | 35.4 | 78.2% | | **SliceGPT*** | ❌ | 31.5 | 31.6 | 18.5 | 43.3 | 47.5 | 68.3 | 27.0 | 29.4 | 28.8 | 24.8 | 35.1 | 77.5% | | **LaCo*** | ❌ | 39.7 | 🏆 34.4 | 🏆 36.1 | 40.4 | 55.7 | 69.8 | 23.6 | 22.6 | 26.5 | 25.2 | 37.4 | 82.7% | | **UIDL*** | ❌ | 40.2 | 🏆 34.4 | 21.5 | 40.4 | 59.7 | 69.0 | 35.2 | 34.7 | 44.6 | 28.9 | 40.9 | 90.3% | | | | | | | | | | | | | | | | | **ReplaceMe (this model)** | ✅ | 42.5 | 33.0 | 25.2 | 38.5 | 59.4 | 71.1 | 35.4 | 🏆 36.7 | 🏆 46.4 | 🏆 30.4 | 🏆 **41.9** | 🏆 **92.5%** | **Key:** - 🏆 Best performance in column - ✅ Training-free (our methods) - ❌ Requires training - *Numbers taken from Streamline paper **Metrics Explained:** - **RP**: Relative Performance (% of baseline) - **Bold**: Best training-free results - All numbers are accuracy scores > 🔥 **Our training-free methods achieve 92.5% of baseline performance while other approaches require expensive retraining!** ## Installation ```bash pip install replaceme # or git clone https://github.com/mts-ai/ReplaceMe cd ReplaceMe pip install -e . ``` ## Basic Usage ``` # LSTSQ method (recommended) run_replaceme --config ./reproduce/Replace_Me_pipeline_lstsq.yaml # Cosine similarity method run_replaceme --config ./reproduce/Replace_Me_pipeline_cosine.yaml ``` There are many parameters you can play with, visit our repo and dscover 🔥🔥 ## Load Model As we said we are merging the LTs with the original transformer architecture so you just do it as usual ```python ## EXAMPLE from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "MTSAIR/Llama2-5B-ReplaceMe" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "What is ReplaceME pruning method?!" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) output = model.generate( **model_inputs, max_new_tokens=512 ) response = tokenizer.batch_decode(output, skip_special_tokens=True)[0] ``` # Citation If you use ReplaceMe in your research, please cite our paper: ```bibtex @article{shopkhoev2025replaceme0, title = {ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations}, author = {Dmitriy Shopkhoev and Ammar Ali and Magauiya Zhussip and Valentin Malykh and Stamatios Lefkimmiatis and Nikos Komodakis and Sergey Zagoruyko}, year = {2025}, journal = {arXiv preprint arXiv: 2505.02819} } ```