File size: 6,728 Bytes
7f8229b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb2526c
7f8229b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da97831
7f8229b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
inference: false
base_model: c4ai/command-a-03-2025
pipeline_tag: text-generation
model_type: command-a
tags:
  - quantization
  - onebit
  - compression
  - command-a
  - text-generation
library_name: transformers
language:
- en
- ja
license: cc-by-nc-4.0
extra_gated_prompt: "By submitting this form, you agree to the [License Agreement](https://cohere.com/c4ai-cc-by-nc-license)  and acknowledge that the information you provide will be collected, used, and shared in accordance with Cohere’s [Privacy Policy]( https://cohere.com/privacy). You’ll receive email updates about Cohere Labs and Cohere research, events, products and services. You can unsubscribe at any time." 
extra_gated_fields:
 Name: text
 Affiliation: text
 Country: country
 I agree to use this model for non-commercial use ONLY: checkbox
---

---

# **Model Card for qep qep 1bit extreme**

🚨 **This model is 1bit quantized version of Cohere Labs Command A using QEP.** You can find the unquantized version of Cohere Labs Command A [here](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025).


## **Model Summary**
 
An optimized 1-bit quantized version of [c4ai/command-a-03-2025](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025) achieving **6.7x compression** with enhanced performance through advanced quantization optimization techniques.

## Key Features

- **Extreme Compression**: 6.7× smaller (207GB → 30.2GB, -85%), runs even on a single GPU (30B on A100 80GB).
- **Enhanced Performance**: [Onebit](https://arxiv.org/abs/2402.11295) quantization, enhanced by Fujitsu [QEP](https://arxiv.org/abs/2504.09629) & [QQA](https://iclr.cc/virtual/2025/poster/30713).   
- **Inference Speed Up**: Faster inference via "Bitlinear computation".


## Model Details

- **Base Model**: c4ai/command-a-03-2025
- **Quantization Method**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) with Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization
- **Quantization Bits**: 1-bit for layers 0-61, FP16 for last 2 layers
- **Optimization Techniques**: Fujitsu [QEP](https://arxiv.org/abs/2504.09629), [QQA](https://iclr.cc/virtual/2025/poster/30713)
- **Compatible Hardware**: Single GPU (recommended: >= 40GB VRAM)


Developed by: [Fujitsu](https:/fujitsu.com/), [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.for.ai/)

* Point of Contact: [Contact form](https://contactline.jp.fujitsu.com/customform/csque04802/873532/) or [Email]([email protected])  
* License:[CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy)



For more details on how this model was developed, check out our [Press Release (English)](https://global.fujitsu/-/media/Project/Fujitsu/Fujitsu-HQ/pr/news/2025/09/08-01-en.pdf), [Press Release (Japanese)](https://global.fujitsu/ja-jp/pr/news/2025/09/08-01) Fujitsu's [Tech Report](https://arxiv.org/abs/2504.09629) and Cohere's [Tech Report](https://arxiv.org/abs/2504.00698).



## Usage

The base architecture of this model is **Command-A**. To load and use the model, please use the **CommandA model class**:

1. Load `model.safetensors`, which contains the quantized weights.
2. Replace all layers **except the last two** with **bitlinear implementations**.
3. Keep the **last two layers with non-quantized weights** for optimal performance.
4. The model requires the included `onebit_linear.py` for proper quantized layer implementation. The weights contain parameters for each of the **OneBit-specific a, S, and b components** necessary for reconstruction.
5. Depending on the level of performance you wish to maintain, you may keep additional layers near the output unquantized.

**Note:** Direct loading support as an extension of the `transformers` package is planned for future releases.


## Requirements
```
torch>=2.0.0
transformers>=4.35.0
safetensors>=0.4.0
```


## Performance
 
- **Memory Usage**: 6.7x reduction overall (207GB → 30.2GB)
- **Inference Speed**: Optimized for fast generation on single GPU
- **Quality**: Enhanced performance through [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization
- **Compatibility**: Single GPU deployment capable

## Technical Specifications

- **Original Model**: Command-A (c4ai/command-a-03-2025)
- **Quantized Layers**: 62 layers (0-61) with 1-bit precision
- **Preserved Layers**: 2 layers (62-63) with FP16 precision
- **Compression Technique**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) + Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713)
- **Model Size**: 30.2GB (from original 207GB)

## Future Plans

- **Global and Block-wise Fine-tuning**: Explore fine-tuning strategies, including block-wise methods, to further improve accuracy and robustness.  
- **Complete Usage Examples**: Provide detailed implementation guides for efficient single-GPU deployment.
- **Optimization Updates**: Enhance performance with next-generation quantization techniques and improved reconstruction methods.

Currently, the quantization process preserves the last two layers in **non-quantized weights** to maintain output quality, while applying aggressive **1-bit quantization** to the remaining layers. Future releases will integrate **block-wise fine-tuning** for additional performance gains.



## Ethical Considerations

This model inherits the capabilities and limitations of the base Command A model. Please refer to the original model's documentation for ethical guidelines and potential biases.


## **Model Card Contact**

For errors or additional questions about details in this model card, contact [email protected]

## **Terms of Use:**

We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant model to researchers all over the world. This model is governed by a [CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy)


## **Citation**

If you use this model, please cite:

```bibtex
@misc{command-a-onebit-hybrid,
  title={Command-A 111B with QEP-Optimized OneBit Extreme Quantization},
  author={Yuma Ichikawa, Yusei Kawakami, Yoshiyuki Ishii, Keiji Kimura and Akira Sakai},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/qep/qep-1bit-extreme}
}
```


## License

This quantized model is released under the same license as the base Command A model (CC-BY-NC-4.0).

---