File size: 6,728 Bytes
7f8229b fb2526c 7f8229b da97831 7f8229b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
inference: false
base_model: c4ai/command-a-03-2025
pipeline_tag: text-generation
model_type: command-a
tags:
- quantization
- onebit
- compression
- command-a
- text-generation
library_name: transformers
language:
- en
- ja
license: cc-by-nc-4.0
extra_gated_prompt: "By submitting this form, you agree to the [License Agreement](https://cohere.com/c4ai-cc-by-nc-license) and acknowledge that the information you provide will be collected, used, and shared in accordance with Cohere’s [Privacy Policy]( https://cohere.com/privacy). You’ll receive email updates about Cohere Labs and Cohere research, events, products and services. You can unsubscribe at any time."
extra_gated_fields:
Name: text
Affiliation: text
Country: country
I agree to use this model for non-commercial use ONLY: checkbox
---
---
# **Model Card for qep qep 1bit extreme**
🚨 **This model is 1bit quantized version of Cohere Labs Command A using QEP.** You can find the unquantized version of Cohere Labs Command A [here](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025).
## **Model Summary**
An optimized 1-bit quantized version of [c4ai/command-a-03-2025](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025) achieving **6.7x compression** with enhanced performance through advanced quantization optimization techniques.
## Key Features
- **Extreme Compression**: 6.7× smaller (207GB → 30.2GB, -85%), runs even on a single GPU (30B on A100 80GB).
- **Enhanced Performance**: [Onebit](https://arxiv.org/abs/2402.11295) quantization, enhanced by Fujitsu [QEP](https://arxiv.org/abs/2504.09629) & [QQA](https://iclr.cc/virtual/2025/poster/30713).
- **Inference Speed Up**: Faster inference via "Bitlinear computation".
## Model Details
- **Base Model**: c4ai/command-a-03-2025
- **Quantization Method**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) with Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization
- **Quantization Bits**: 1-bit for layers 0-61, FP16 for last 2 layers
- **Optimization Techniques**: Fujitsu [QEP](https://arxiv.org/abs/2504.09629), [QQA](https://iclr.cc/virtual/2025/poster/30713)
- **Compatible Hardware**: Single GPU (recommended: >= 40GB VRAM)
Developed by: [Fujitsu](https:/fujitsu.com/), [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.for.ai/)
* Point of Contact: [Contact form](https://contactline.jp.fujitsu.com/customform/csque04802/873532/) or [Email]([email protected])
* License:[CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy)
For more details on how this model was developed, check out our [Press Release (English)](https://global.fujitsu/-/media/Project/Fujitsu/Fujitsu-HQ/pr/news/2025/09/08-01-en.pdf), [Press Release (Japanese)](https://global.fujitsu/ja-jp/pr/news/2025/09/08-01) Fujitsu's [Tech Report](https://arxiv.org/abs/2504.09629) and Cohere's [Tech Report](https://arxiv.org/abs/2504.00698).
## Usage
The base architecture of this model is **Command-A**. To load and use the model, please use the **CommandA model class**:
1. Load `model.safetensors`, which contains the quantized weights.
2. Replace all layers **except the last two** with **bitlinear implementations**.
3. Keep the **last two layers with non-quantized weights** for optimal performance.
4. The model requires the included `onebit_linear.py` for proper quantized layer implementation. The weights contain parameters for each of the **OneBit-specific a, S, and b components** necessary for reconstruction.
5. Depending on the level of performance you wish to maintain, you may keep additional layers near the output unquantized.
**Note:** Direct loading support as an extension of the `transformers` package is planned for future releases.
## Requirements
```
torch>=2.0.0
transformers>=4.35.0
safetensors>=0.4.0
```
## Performance
- **Memory Usage**: 6.7x reduction overall (207GB → 30.2GB)
- **Inference Speed**: Optimized for fast generation on single GPU
- **Quality**: Enhanced performance through [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization
- **Compatibility**: Single GPU deployment capable
## Technical Specifications
- **Original Model**: Command-A (c4ai/command-a-03-2025)
- **Quantized Layers**: 62 layers (0-61) with 1-bit precision
- **Preserved Layers**: 2 layers (62-63) with FP16 precision
- **Compression Technique**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) + Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713)
- **Model Size**: 30.2GB (from original 207GB)
## Future Plans
- **Global and Block-wise Fine-tuning**: Explore fine-tuning strategies, including block-wise methods, to further improve accuracy and robustness.
- **Complete Usage Examples**: Provide detailed implementation guides for efficient single-GPU deployment.
- **Optimization Updates**: Enhance performance with next-generation quantization techniques and improved reconstruction methods.
Currently, the quantization process preserves the last two layers in **non-quantized weights** to maintain output quality, while applying aggressive **1-bit quantization** to the remaining layers. Future releases will integrate **block-wise fine-tuning** for additional performance gains.
## Ethical Considerations
This model inherits the capabilities and limitations of the base Command A model. Please refer to the original model's documentation for ethical guidelines and potential biases.
## **Model Card Contact**
For errors or additional questions about details in this model card, contact [email protected]
## **Terms of Use:**
We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant model to researchers all over the world. This model is governed by a [CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy)
## **Citation**
If you use this model, please cite:
```bibtex
@misc{command-a-onebit-hybrid,
title={Command-A 111B with QEP-Optimized OneBit Extreme Quantization},
author={Yuma Ichikawa, Yusei Kawakami, Yoshiyuki Ishii, Keiji Kimura and Akira Sakai},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/qep/qep-1bit-extreme}
}
```
## License
This quantized model is released under the same license as the base Command A model (CC-BY-NC-4.0).
--- |