File size: 6,728 Bytes

---
inference: false
base_model: c4ai/command-a-03-2025
pipeline_tag: text-generation
model_type: command-a
tags:
  - quantization
  - onebit
  - compression
  - command-a
  - text-generation
library_name: transformers
language:
- en
- ja
license: cc-by-nc-4.0
extra_gated_prompt: "By submitting this form, you agree to the [License Agreement](https://cohere.com/c4ai-cc-by-nc-license)  and acknowledge that the information you provide will be collected, used, and shared in accordance with Cohere’s [Privacy Policy]( https://cohere.com/privacy). You’ll receive email updates about Cohere Labs and Cohere research, events, products and services. You can unsubscribe at any time." 
extra_gated_fields:
 Name: text
 Affiliation: text
 Country: country
 I agree to use this model for non-commercial use ONLY: checkbox
---

---

# **Model Card for qep qep 1bit extreme**

🚨 **This model is 1bit quantized version of Cohere Labs Command A using QEP.** You can find the unquantized version of Cohere Labs Command A [here](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025).


## **Model Summary**
 
An optimized 1-bit quantized version of [c4ai/command-a-03-2025](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025) achieving **6.7x compression** with enhanced performance through advanced quantization optimization techniques.

## Key Features

- **Extreme Compression**: 6.7× smaller (207GB → 30.2GB, -85%), runs even on a single GPU (30B on A100 80GB).
- **Enhanced Performance**: [Onebit](https://arxiv.org/abs/2402.11295) quantization, enhanced by Fujitsu [QEP](https://arxiv.org/abs/2504.09629) & [QQA](https://iclr.cc/virtual/2025/poster/30713).   
- **Inference Speed Up**: Faster inference via "Bitlinear computation".


## Model Details

- **Base Model**: c4ai/command-a-03-2025
- **Quantization Method**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) with Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization
- **Quantization Bits**: 1-bit for layers 0-61, FP16 for last 2 layers
- **Optimization Techniques**: Fujitsu [QEP](https://arxiv.org/abs/2504.09629), [QQA](https://iclr.cc/virtual/2025/poster/30713)
- **Compatible Hardware**: Single GPU (recommended: >= 40GB VRAM)


Developed by: [Fujitsu](https:/fujitsu.com/), [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.for.ai/)

* Point of Contact: [Contact form](https://contactline.jp.fujitsu.com/customform/csque04802/873532/) or [Email]([email protected])  
* License:[CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy)



For more details on how this model was developed, check out our [Press Release (English)](https://global.fujitsu/-/media/Project/Fujitsu/Fujitsu-HQ/pr/news/2025/09/08-01-en.pdf), [Press Release (Japanese)](https://global.fujitsu/ja-jp/pr/news/2025/09/08-01) Fujitsu's [Tech Report](https://arxiv.org/abs/2504.09629) and Cohere's [Tech Report](https://arxiv.org/abs/2504.00698).



## Usage

The base architecture of this model is **Command-A**. To load and use the model, please use the **CommandA model class**:

1. Load `model.safetensors`, which contains the quantized weights.
2. Replace all layers **except the last two** with **bitlinear implementations**.
3. Keep the **last two layers with non-quantized weights** for optimal performance.
4. The model requires the included `onebit_linear.py` for proper quantized layer implementation. The weights contain parameters for each of the **OneBit-specific a, S, and b components** necessary for reconstruction.
5. Depending on the level of performance you wish to maintain, you may keep additional layers near the output unquantized.

**Note:** Direct loading support as an extension of the `transformers` package is planned for future releases.


## Requirements
```
torch>=2.0.0
transformers>=4.35.0
safetensors>=0.4.0
```


## Performance
 
- **Memory Usage**: 6.7x reduction overall (207GB → 30.2GB)
- **Inference Speed**: Optimized for fast generation on single GPU
- **Quality**: Enhanced performance through [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization
- **Compatibility**: Single GPU deployment capable

## Technical Specifications

- **Original Model**: Command-A (c4ai/command-a-03-2025)
- **Quantized Layers**: 62 layers (0-61) with 1-bit precision
- **Preserved Layers**: 2 layers (62-63) with FP16 precision
- **Compression Technique**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) + Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713)
- **Model Size**: 30.2GB (from original 207GB)

## Future Plans

- **Global and Block-wise Fine-tuning**: Explore fine-tuning strategies, including block-wise methods, to further improve accuracy and robustness.  
- **Complete Usage Examples**: Provide detailed implementation guides for efficient single-GPU deployment.
- **Optimization Updates**: Enhance performance with next-generation quantization techniques and improved reconstruction methods.

Currently, the quantization process preserves the last two layers in **non-quantized weights** to maintain output quality, while applying aggressive **1-bit quantization** to the remaining layers. Future releases will integrate **block-wise fine-tuning** for additional performance gains.



## Ethical Considerations

This model inherits the capabilities and limitations of the base Command A model. Please refer to the original model's documentation for ethical guidelines and potential biases.


## **Model Card Contact**

For errors or additional questions about details in this model card, contact [email protected]

## **Terms of Use:**

We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant model to researchers all over the world. This model is governed by a [CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy)


## **Citation**

If you use this model, please cite:

```bibtex
@misc{command-a-onebit-hybrid,
  title={Command-A 111B with QEP-Optimized OneBit Extreme Quantization},
  author={Yuma Ichikawa, Yusei Kawakami, Yoshiyuki Ishii, Keiji Kimura and Akira Sakai},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/qep/qep-1bit-extreme}
}
```


## License

This quantized model is released under the same license as the base Command A model (CC-BY-NC-4.0).

---