Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic
Uniform FP8 quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled — a Claude 4.6 Opus reasoning-distilled Qwen3.5-27B model.
~29 GB on disk (~27 GiB in VRAM). Near-lossless FP8 quantization with only 1.4% perplexity degradation vs BF16. Recommended GPU: NVIDIA RTX PRO 6000 (96 GB) or other GPUs with >= 48 GB VRAM.
For 32 GB GPUs (RTX 5090): Use the NVFP4 mixed-precision variant instead (~25 GB, fits with usable context on a single 5090).
Quantization Strategy
Uniform FP8 W8A8 dynamic quantization using llm-compressor v0.10.1, stored in the compressed-tensors format. No calibration data needed — weight scales are computed statically per-channel, activation scales are computed dynamically per-token at inference time.
| Precision | Layers | Rationale |
|---|---|---|
| FP8 W8A8 (per-channel weights, per-token dynamic activations) | All nn.Linear layers except those in the ignore list |
Near-lossless: FP8 E4M3 preserves 3 mantissa bits with per-channel granularity |
| BF16 (unquantized) | lm_head, embed_tokens, DeltaNet small projections (in_proj_a, in_proj_b), all norms, visual encoder, MoE router gates |
lm_head amplifies errors across 248K vocab; embed_tokens is a lookup table; DeltaNet low-rank projections are numerically sensitive; vision tower retained at full precision |
Weight Breakdown
| Component | Size | Precision |
|---|---|---|
| MLP | 14.6 GB | FP8 |
| DeltaNet attention | 6.9 GB | FP8 + BF16 |
| lm_head | 2.5 GB | BF16 |
| embed_tokens | 2.5 GB | BF16 |
| Softmax attention | 2.1 GB | FP8 |
| Visual encoder | 0.9 GB | BF16 |
| Total | ~29 GB |
Architecture
Qwen3.5-27B uses a hybrid DeltaNet + softmax attention architecture with full_attention_interval=4:
Layer pattern (64 layers):
[DeltaNet, DeltaNet, DeltaNet, Softmax] × 16
= 48 DeltaNet layers + 16 softmax attention layers
Key architectural parameters:
- Hidden size: 5,120
- Attention heads: 24 (query), 4 (KV, GQA)
- Head dimension: 256
- DeltaNet heads: 16 key, 48 value (dim 128 each)
- MLP intermediate: 17,408
- Vocabulary: 248,320
- Max position embeddings: 262,144
Only 16 of 64 layers require KV cache — the 48 DeltaNet layers use a fixed-size recurrent state that doesn't grow with sequence length. This gives ~4x more context capacity than a standard transformer of the same size.
KV Cache Budget
Per-token KV cache cost (only 16 softmax layers):
- FP16: 4 KV heads x 256 dim x 2 (K+V) x 2 bytes x 16 layers = 64 KB/token
- FP8: 32 KB/token
| GPU | Available for KV | Max Context (FP8 KV) |
|---|---|---|
| RTX 5090 (32 GB) | ~4 GiB | ~128K tokens (single request) |
| RTX PRO 6000 (96 GB) | ~68 GiB | 8 concurrent requests × 262K tokens each |
Usage
Serving with vLLM (recommended)
pip install vllm>=0.17.0
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic \
--max-model-len 131072 \
--reasoning-parser qwen3
RTX PRO 6000 / high-VRAM GPUs (>= 48 GB):
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic \
--max-model-len 262144 \
--reasoning-parser qwen3
Note: On RTX 5090 (32 GB), the same Blackwell-specific vLLM issues that affect the NVFP4 variant also apply here. See the NVFP4 model card for details and tracking PRs. On GPUs with >= 48 GB VRAM, these issues are irrelevant.
Transformers (direct loading)
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic",
trust_remote_code=True,
)
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | FP8 W8A8 with Blackwell FP8 acceleration. Works on >= 48 GB GPUs; 32 GB Blackwell GPUs require upcoming vLLM fixes |
| transformers >= 5.3.0 | Yes | Direct loading with device_map="auto" |
| SGLang | Yes | FP8 compressed-tensors supported |
| llama.cpp / GGUF | No | compressed-tensors FP8 format not supported |
Hardware Requirements
| Configuration | VRAM | Notes |
|---|---|---|
| Minimum | 32 GB | Weights only, minimal context |
| RTX PRO 6000 (recommended) | 96 GB | 8 concurrent × 262K context with FP8 KV cache. Works out of the box with vLLM 0.17.0 |
| 2x RTX 5090 | 64 GB | Tensor parallel, full context |
| RTX 5090 (single) | 32 GB | Not recommended — only ~4 GiB free for KV cache after model loading. Use the NVFP4 variant instead |
Benchmark Results
Comparison against the BF16 source model. All benchmarks run on NVIDIA RTX PRO 6000 (96 GB) with vLLM 0.17.0, temperature=0.6 for generation tasks (Qwen recommended setting for thinking mode).
| Benchmark | BF16 (54 GB) | FP8 (29 GB) | Delta |
|---|---|---|---|
| Perplexity (FineWeb-Edu, 100 samples) | 6.6119 | 6.7026 | +0.09 (+1.4%) |
| MMLU-Pro (500 samples) | 54.0% | 56.0%* | +2.0% |
| ARC-Challenge (1,172 samples) | 97.6% | 100%* | +2.4% |
| GSM8K Platinum (200 samples) | 99.5% | — | — |
| AIME 2025 (30 problems) | 40.0% | — | — |
| Throughput (single GPU) | 17.8 tok/s | 29.1 tok/s | +1.6x |
*FP8 MMLU-Pro and ARC ran with 50 samples (quick mode); BF16 used full sample sizes. Full FP8 benchmarks will be updated.
Summary: FP8 quantization is near-lossless — perplexity degrades only 1.4% vs BF16, while throughput improves 1.6x from reduced memory bandwidth. For comparison, the NVFP4 variant (25 GB) shows 2.1% perplexity degradation but fits in 4 GB less VRAM.
Source Model
This is a quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which is an SFT fine-tune of Qwen/Qwen3.5-27B using Claude 4.6 Opus reasoning distillation data.
Training datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- TeichAI/claude-4.5-opus-high-reasoning-250x
- Jackrong/Qwen3.5-reasoning-700x
Quantization Details
- Tool: llm-compressor v0.10.1
- Format: compressed-tensors (uniform FP8)
- Scheme: FP8 W8A8 dynamic — per-channel static weight scales, per-token dynamic activation scales
- Calibration: None required (weight-only scale computation)
- Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB)
License
Apache 2.0, following the base model license.
- Downloads last month
- 470
Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-FP8-Dynamic
Base model
Qwen/Qwen3.5-27B