Error when utilizing DUAL_CHUNK_FLASH_ATTN with vLLM.

#23
by lssj14 - opened

Hello,
I'm trying to run this model with vLLM utilizing dual chunk flash attention.
However, it showed following errors.

vLLM v0.11.1rc2

Value error, Invalid value 'DUAL_CHUNK_FLASH_ATTN' for VLLM_ATTENTION_BACKEND. Valid options: ['FLASH_ATTN', 'TRITON_ATTN', 'XFORMERS', 'ROCM_ATTN', 'ROCM_AITER_MLA', 'ROCM_AITER_FA', 'TORCH_SDPA', 'FLASHINFER', 'FLASHINFER_MLA', 'TRITON_MLA', 'CUTLASS_MLA', 'FLASHMLA', 'FLASHMLA_SPARSE', 'FLASH_ATTN_MLA', 'PALLAS', 'IPEX', 'NO_ATTENTION', 'FLEX_ATTENTION', 'TREE_ATTN', 'ROCM_AITER_UNIFIED_ATTN']. [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]

vLLM v0.10.2

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 399, in <lambda>
    lambda prefix: Qwen3MoeDecoderLayer(config=config,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 312, in __init__
    self.self_attn = Qwen3MoeAttention(
                     ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 253, in __init__
    self.attn = Attention(
                ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 182, in __init__
    self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/dual_chunk_flash_attn.py", line 325, in __init__
    assert dual_chunk_attention_config is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Would you help me to find the right config or vLLM version to utilize this model with DUAL_CHUNK_FLASH_ATTN?

Sign up or log in to comment