My personal vLLM launch cmd on my old personal 2x3090 workstation

by tclf90 - opened 15 days ago

QuantTrio org 15 days ago

•

vllm serve \
    ...path.../tclf90/Qwen3.5-35B-A3B-AWQ \
    --served-model-name Qwen3.5-35B-A3B-AWQ \
    --swap-space 4 \
    --max-num-seqs 4 \
    --enable-prefix-caching \
    --max_num_batched_tokens 2112 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 262144 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --no-enforce-eager \
    --compilation-config.mode 3 \
    --compilation-config.cudagraph_mode FULL_AND_PIECEWISE \
    --compilation-config.cudagraph_capture_sizes [1,2,4,8,16,24,32,40,48,56,64] \
    --compilation-config.max_cudagraph_capture_size 64 \
    --compilation-config.use_inductor_graph_partition true \
    --trust-remote-code \
    --host localhost \
    --port 8000

roughly 110t/s for single request (without mtp)
(mtp is slower on this machine, not sure why, so i turned it off for now)

dwaynedu

14 days ago

Stuck on vllm startup. Running vllm nightly, 1xH20

tclf90

QuantTrio org 14 days ago

Stuck on vllm startup. Running vllm nightly, 1xH20

These logs LGTM.
I think it takes time... kind of...

bastoker

7 days ago

•

edited 5 days ago

vllm serve \
    ...path.../tclf90/Qwen3.5-35B-A3B-AWQ \
    --served-model-name Qwen3.5-35B-A3B-AWQ \
    --swap-space 4 \
    --max-num-seqs 4 \
    --enable-prefix-caching \
    --max_num_batched_tokens 2112 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 262144 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --no-enforce-eager \
    --compilation-config.mode 3 \
    --compilation-config.cudagraph_mode FULL_AND_PIECEWISE \
    --compilation-config.cudagraph_capture_sizes [1,2,4,8,16,24,32,40,48,56,64] \
    --compilation-config.max_cudagraph_capture_size 64 \
    --compilation-config.use_inductor_graph_partition true \
    --trust-remote-code \
    --host localhost \
    --port 8000

roughly 110t/s for single request (without mtp)
(mtp is slower on this machine, not sure why, so i turned it off for now)

Thanks for the config, for a single request I get ~55 tps on dual 5060Ti 16GB with MTP and ~35 tps without:

    -v /media/bas/images/.cache:/root/.cache \
    --network=host \
    --ipc=host \
    vllm/vllm-openai:nightly \
    "QuantTrio/Qwen3.5-27B-AWQ" \
    --port 8000 --host 0.0.0.0 \
    --trust-remote-code --tensor-parallel 2 --max-model-len $((1024*128)) --max-num-seqs 1 --gpu-memory-utilization 0.87 \
    --kv-cache-dtype fp8_e4m3 \
    --max-num-batched-tokens 2112 \
    --language-model-only \
    --performance-mode interactivity \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --enable-prefix-caching \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --compilation-config.mode 3 \
    --compilation-config.cudagraph_mode FULL_AND_PIECEWISE \
    --compilation-config.cudagraph_capture_sizes [1,2,4,8,16,24,32] \
    --compilation-config.max_cudagraph_capture_size 32 \
    --compilation-config.use_inductor_graph_partition true

docker image: https://hub.docker.com/layers/vllm/vllm-openai/nightly/images - vllm image id when i pulled was sha256:364d579a2bc60dd4ad5c2cabf5d79d45b979d7867a424ce38ecd28f158c81ad4

mancub

1 day ago

roughly 110t/s for single request (without mtp)
(mtp is slower on this machine, not sure why, so i turned it off for now)

I have a vllm nightlu setup in Debian via Proxmox using 2x3090 with P2P driver and I can barely get 65t/s. This is using your configuration above + no thinking.

Are your 3090s NVLink-ed?

My other hardware is Epyc 7J13 on a H12SSL-NT with 256GB RAM...it can't be my hardware?!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment