Error when utilizing DUAL_CHUNK_FLASH_ATTN with vLLM.
#23
by
lssj14
- opened
Hello,
I'm trying to run this model with vLLM utilizing dual chunk flash attention.
However, it showed following errors.
vLLM v0.11.1rc2
Value error, Invalid value 'DUAL_CHUNK_FLASH_ATTN' for VLLM_ATTENTION_BACKEND. Valid options: ['FLASH_ATTN', 'TRITON_ATTN', 'XFORMERS', 'ROCM_ATTN', 'ROCM_AITER_MLA', 'ROCM_AITER_FA', 'TORCH_SDPA', 'FLASHINFER', 'FLASHINFER_MLA', 'TRITON_MLA', 'CUTLASS_MLA', 'FLASHMLA', 'FLASHMLA_SPARSE', 'FLASH_ATTN_MLA', 'PALLAS', 'IPEX', 'NO_ATTENTION', 'FLEX_ATTENTION', 'TREE_ATTN', 'ROCM_AITER_UNIFIED_ATTN']. [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
vLLM v0.10.2
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 399, in <lambda>
lambda prefix: Qwen3MoeDecoderLayer(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 312, in __init__
self.self_attn = Qwen3MoeAttention(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 253, in __init__
self.attn = Attention(
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 182, in __init__
self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 325, in __init__
assert dual_chunk_attention_config is not None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Would you help me to find the right config or vLLM version to utilize this model with DUAL_CHUNK_FLASH_ATTN?
vllm 0.10.0