--- base_model: - meta-llama/Llama-3.2-3B-Instruct datasets: - JunxiongWang/sftdatasetv3 - HuggingFaceH4/ultrafeedback_binarized - HuggingFaceH4/orca_dpo_pairs - JunxiongWang/llama3-ultrafeedback-armorm model-index: - name: X-EcoMLA-3B3B-fixed-kv816-DPO results: [] tags: - alignment-handbook - generated_from_trainer license: apache-2.0 --- # X-EcoMLA: pcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression X-EcoMLA is an efficient KV cache compression technique for large language models (LLMs) proposed by AMD that upcycles transformer blocks into Multi-head Latent Attention (MLA) for extreme KV cache compression and computational efficiency. Instead of training a MLA model from scratch, the proposed X-EcoMLA first initializes the MLA weights based on Singular Value Decomposition (SVD) of the existing transformer weights, followed by lightweight pre-training or post-training distillation. This model, `X-EcoMLA-3B3B-fixed-kv816-DPO`, is created by efficiently adapting the pre-trained `Llama-3.2-3B-Instruct` model conducted post-training on AMD Instinct™ MI300X GPUs. This training approach bypasses the need for costly pre-training from scratch. ## Key Takeaways - Announcing X-EcoMLA, an efficient approach to upcycle existing transformer blocks into MLA. - Extreme KV Cache Compression: X-EcoMLA dramatically reduces the KV cache size by 6.4x - 10.6x with only 3.6B - 7B training tokens, while preserving almost 100% of its average zero-shot performance on LM Harness tasks. - Novel SVD Initialization: X-EcoMLA employs an efficient SVD-based weight initialization which dramatically improves the training efficiency and model performance. ## Model Composition Pipeline The X-EcoMLA models are not trained from scratch. Instead, they are composed from powerful pre-trained Transformers through a lightweight and efficient pipeline. The creation of this model followed these stages: | Stage | Action | Description | |-------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 1. Base Model | Llama-3.2-3B-Instruct | The starting point is a high-quality, pre-trained Transformer model. | | 2. Initialization | Structured Weight Mapping | MLA models are initialized from the base model's weights using SVD. | | 3. SFT | End-to-End Knowledge Distillation | The initialized model is fine-tuned via knowledge distillation. | | 4. Alignment | Direct Preference Optimization (DPO) | In the final stage, DPO is used to align the model's preferences, with the distilled student model itself serving as the reference model for stability. | ## Training Data |Stage | Dataset | License | |-----------|---------------------------------------------------------------------------|------------------------| | SFT | https://huggingface.co/datasets/teknium/OpenHermes-2.5 | Refer source materials | | SFT | https://huggingface.co/datasets/tomg-group-umd/GenQA | CC BY-NC 4.0 | | SFT | https://huggingface.co/datasets/BAAI/Infinity-Instruct | CC BY-SA 4.0 | | DPO | https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized | MIT | | DPO | https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs | MIT | | DPO | https://huggingface.co/datasets/JunxiongWang/llama3-ultrafeedback-armorm | MIT | ## Getting Started ### Installation ``` git clone https://github.com/AMD-AIG-AIMA/AMD-Hybrid-Models.git cd AMD-Hybrid-Models/X-EcoMLA ``` Then follow the installation instruction in `AMD-AIG-AIMA/AMD-Hybrid-Models` repo. ### Example Usage Once the installation completed, we can try the following code for a quick test ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from mla.hybrid_wrapper import MLATransformerHybridModelWrapper checkpoint = "amd/X-EcoMLA-3B3B-fixed-kv816-DPO" model = MLATransformerHybridModelWrapper.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).cuda() tokenizer = AutoTokenizer.from_pretrained(checkpoint) model.eval() # Format the prompt using the chat template prompt = [{"role": "user", "content": "What are the benefits of hybrid language models?"}] input_ids = tokenizer.apply_chat_template( prompt, add_generation_prompt=True, return_tensors='pt' ).cuda() # Generate a response tokens = model.generate( input_ids, max_new_tokens=256, temperature=0.7, do_sample=True, eos_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(tokens[0], skip_special_tokens=False)) ``` Model Evaluation: ``` python benchmark/llm_eval/lm_harness_eval.py \ --model mla_hybrid \ --model_args pretrained="amd/X-EcoMLA-3B3B-fixed-kv816-DPO" \ --tasks mmlu,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa,pubmedqa,race \ --num_fewshot 0 --device cuda --batch_size 16 ``` ### Model details | Model | KV Size | Target Model| Teacher Model| Training Tokens| Pre-/Post-Training| rkv| rq| drope | dnope | |-------|--------:|------:|------:|------:|------:|-----:|-----:|---------:|---------:| |X-EcoMLA-1B1B-fixed-kv512-DPO | 53.1% | Llama-3.2-1B-Instruct| Llama-3.2-1B-Instruct | 7B | Post |512 | 864 | 32 | 32 | |X-EcoMLA-1B1B-dynamic-0.95-DPO | 54.7% | Llama-3.2-1B-Instruct| Llama-3.2-1B-Instruct | 7B |Post | 0.95 | 0.95 | 32 | 32 | |X-EcoMLA-1B8B-fixed-kv64-DPO | 9.4% | Llama-3.2-1B-Instruct| Llama-3.1-8B-Instruct | 7B | Post | 64 |1424 | 32 | 32 | |X-EcoMLA-3B3B-fixed-kv816-DPO | 43% | Llama-3.2-3B-Instruct| Llama-3.2-3B-Instruct | 7B | Post | 816 | 1536 | 64 | 64 | |X-EcoMLA-3B3B-dynamic-0.95-DPO | 43% | Llama-3.2-3B-Instruct| Llama-3.2-3B-Instruct | 7B | Post |0.95 |0.95 | 64 | 64 | |X-EcoMLA-SmolLM-1.7B-fixed-kv480-Pretrain | 12.5% | SmolLM-1.7B | - | 6B | Pre | 480 | 2048 | 32 | 32 | |X-EcoMLA-SmolLM-1.7B1.7B-fixed-kv480-Pretrain | 12.5% | SmolLM-1.7B | SmolLM-1.7B | 6B | Pre | 480 | 2048 | 32 | 32 | |X-EcoMLA-SmolLM-1.7B1.7B-fixed-kv480-DPO | 12.5% | SmolLM-1.7B-Instruct | SmolLM-1.7B-Instruct | 7B | Post | 480 | 2048 | 32 | 32 | ### Benchmark results X-EcoMLA was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency. | Tasks | Metric | Llama-3.2-3B-Instruct | X-EcoMLA-3B3B-fixed-kv816-DPO | X-EcoMLA-3B3B-dynamic-0.95-DPO | |-------------------|----------|----------------: |----------------: |----------------:| | arc_challenge | acc | 0.4369±0.0145 | 0.4753±0.0146 | 0.4710±0.0146 | | | acc_norm | 0.4590±0.0146 | 0.4821±0.0146 | 0.4846±0.0146 | | arc_easy | acc | 0.7428±0.0090 | 0.7660±0.0087 | 0.7580±0.0088 | | | acc_norm | 0.6776±0.0096 | 0.7045±0.0094 | 0.6999±0.0094 | | hellaswag | acc | 0.5222±0.0050 | 0.5288±0.0050 | 0.5320±0.0050 | | | acc_norm | 0.7036±0.0046 | 0.7224±0.0045 | 0.7226±0.0045 | | mmlu | acc | 0.6046±0.1057 | 0.5742±0.1014 | 0.5773±0.1028 | | - humanities | acc | 0.5926±0.0826 | 0.5507±0.0843 | 0.5518±0.0851 | | - other | acc | 0.6598±0.1118 | 0.6312±0.1011 | 0.6344±0.1070 | | - social_sciences | acc | 0.6701±0.0712 | 0.6383±0.0741 | 0.6422±0.0765 | | - stem | acc | 0.5043±0.1122 | 0.4906±0.1089 | 0.4960±0.1071 | | openbookqa | acc | 0.2740±0.0200 | 0.2920±0.0204 | 0.3000±0.0205 | | | acc_norm | 0.3620±0.0215 | 0.3840±0.0218 | 0.3940±0.0219 | | piqa | acc | 0.7606±0.0100 | 0.7573±0.0100 | 0.7579±0.0100 | | | acc_norm | 0.7557±0.0100 | 0.7655±0.0099 | 0.7579±0.0100 | | pubmedqa | acc | 0.6960±0.0206 | 0.6680±0.0211 | 0.6840±0.0208 | | race | acc | 0.4077±0.0152 | 0.4622±0.0154 | 0.4632±0.0154 | | winogrande | acc | 0.6717±0.0132 | 0.6859±0.0130 | 0.6590±0.0133 | ## Conclusion X-EcoMLA demonstrates an efficient technique to upcycle pre-trained Transformers into MLA modules to compress KV cache. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments. ## Bias, Risks, and Limitations - This model is a research artifact and has not been evaluated for safety in production use cases. - The model's performance is dependent on the quality of its pre-trained base model and the teacher model used during distillation. Its capabilities and biases are inherited from these sources. - The model may generate content that is factually inaccurate, biased, or otherwise objectionable. Users should be aware of these risks and implement appropriate safeguards for their applications. - One limitation of this work is the reliance on a strong teacher model for knowledge transfer, which may not always be available. Distillation from a teacher also adds to the resource requirements during the post-training phase. ## Citation If you find this model useful, please consider citing the original paper: ``` @article{li2025x, title={X-ecomla: Upcycling pre-trained attention into mla for efficient and extreme kv compression}, author={Li, Guihong and Rezagholizadeh, Mehdi and Yang, Mingyu and Appia, Vikram and Barsoum, Emad}, journal={arXiv preprint arXiv:2503.11132}, year={2025} } ```