| INFO 09-14 12:09:21 [__init__.py:235] Automatically detected platform cuda. | |
| INFO 09-14 12:09:28 [config.py:1604] Using max model len 3072 | |
| WARNING 09-14 12:09:28 [cuda.py:103] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used | |
| INFO 09-14 12:09:28 [llm_engine.py:228] Initializing a V0 LLM engine (v0.10.0) with config: model='/home/work/sj/medgemma/250827_benchmarking/lingshu-7b/outputs/lm_srrg_impression_ift/merged', speculative_config=None, tokenizer='/home/work/sj/medgemma/250827_benchmarking/lingshu-7b/outputs/lm_srrg_impression_ift/merged', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=/home/work/sj/medgemma/250827_benchmarking/lingshu-7b/outputs/lm_srrg_impression_ift/merged, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":false,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}, use_cached_outputs=False, | |
| INFO 09-14 12:09:29 [cuda.py:326] Using XFormers backend. | |
| INFO 09-14 12:09:30 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 | |
| INFO 09-14 12:09:30 [model_runner.py:1083] Starting to load model /home/work/sj/medgemma/250827_benchmarking/lingshu-7b/outputs/lm_srrg_impression_ift/merged... | |
| INFO 09-14 12:09:48 [default_loader.py:262] Loading weights took 17.51 seconds | |
| INFO 09-14 12:09:48 [model_runner.py:1115] Model loading took 15.5675 GiB and 17.714645 seconds | |
| WARNING 09-14 12:09:49 [profiling.py:276] The sequence length (3072) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`. | |
| WARNING 09-14 12:09:49 [model_runner.py:1274] Computed max_num_seqs (min(256, 3072 // 65536)) to be less than 1. Setting it to the minimum value of 1. | |
| WARNING 09-14 12:09:56 [profiling.py:237] The sequence length used for profiling (max_num_batched_tokens / max_num_seqs = 3072) is too short to hold the multi-modal embeddings in the worst case (65536 tokens in total, out of which {'image': 49152, 'video': 16384} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`. | |
| INFO 09-14 12:10:05 [worker.py:295] Memory profiling takes 16.31 seconds | |
| INFO 09-14 12:10:05 [worker.py:295] the current vLLM instance can use total_gpu_memory (79.19GiB) x gpu_memory_utilization (0.90) = 71.27GiB | |
| INFO 09-14 12:10:05 [worker.py:295] model weights take 15.57GiB; non_torch_memory takes 0.16GiB; PyTorch activation peak memory takes 9.87GiB; the rest of the memory reserved for KV Cache is 45.67GiB. | |
| INFO 09-14 12:10:05 [executor_base.py:113] # cuda blocks: 53446, # CPU blocks: 4681 | |
| INFO 09-14 12:10:05 [executor_base.py:118] Maximum concurrency for 3072 tokens per request: 278.36x | |
| INFO 09-14 12:10:09 [llm_engine.py:424] init engine (profile, create kv cache, warmup model) took 20.83 seconds | |