Kimi K2.5 using ktkernel + sglang, 16 TPS, but no starting <think> tag.

#28

by gyularabai - opened 12 days ago

12 days ago

Hi, I am running Kimi K2.5 using ktransformers and sglang, with the following command on an Amd Epyc 9755 CPU + 768GB DDR5 system + Nvidia RTX 6000 PRO 96Gb GPU. The generation speed is 16 token/sec. The problem is that the model does not return an opening tag. It returns the thinking content with a closing tag followed by the standard response, but I need the opening tag for my clients (Open WebUI, Cline, etc) to operate properly.

Any suggestions on how to solve this?

[Unit]
Description=Kimi 2.5 Server
After=network.target

[Service]
User=user
WorkingDirectory=/home/user/kimi2.5
Environment="CUDA_HOME=/usr/local/cuda-12.9"
Environment="PATH="/usr/local/cuda-12.9/bin:$PATH" Environment=LD_LIBRARY_PATH="/usr/local/cuda-12.9/lib64:${LD_LIBRARY_PATH:-}"

ExecStart=bash -c 'source /home/user/miniconda3/bin/activate kimi25; \

python -m sglang.launch_server
--host 0.0.0.0 \
--port 10002 \
--model /home/user/models/Kimi-K2.5 \
--kt-weight-path /home/user/models/Kimi-K2.5
--kt-cpuinfer 120 \
--kt-threadpool-count 1
--kt-num-gpu-experts 30
--kt-method RAWINT4
--kt-gpu-prefill-token-threshold 400 \
--reasoning-parser kimi_k2
--tool-call-parser kimi_k2 \
--trust-remote-code \
--mem-fraction-static 0.94
--served-model-name Kimi-K2.5
--enable-mixed-chunk \
--tensor-parallel-size 1
--enable-p2p-check \
--disable-shared-experts-fusion
--context-length 131072 \
--chunked-prefill-size 131072
--max-total-tokens 150000 \
--attention-backend flashinfer'

Restart=on-failure
TimeoutStartSec=600

[Install]
WantedBy=multi-user.target

After running the above command, there is no starting tag in the response. The reasong is there with a closing tag, but the start tag is missing.

The --reasoning-parser kimi_k2 flag has no effect, the reasoning content is never parsed into the reasoning field in the response.

Any suggestions on how to get the starting tag into the response?

Here is an example response:

"data": { "id": "7bbe0883ed364588a6633cab94d20a42", "object": "chat.completion.chunk", "created": 1769694082, "model": "Kimi-K2.5", "choices": [ { "index": 0, "message": { "role": null, "content": " The user is asking a very simple question: \"How big is an apple\". This is a straightforward factual question about the typical size of an apple. I should provide a helpful, accurate answer that covers the typical dimensions while acknowledging that apples vary in size by variety.\n\nKey points to cover:\n1. Typical diameter range (2.5 to 3.5 inches or 6 to 9 cm)\n2. Typical weight range (150-250 grams or 5-9 ounces)\n3. Variation by variety (from crab apples to large cooking apples)\n4. Comparison to common objects for context (tennis ball, baseball, fist)\n\nI should keep it concise but informative, giving both metric and imperial measurements since the user didn't specify a unit system.\n\nStructure:\n- General size description\n- Specific measurements (diameter/weight)\n- Variations by type\n- Visual comparisons\n\nThis is a safe, straightforward question with no concerning content. I should provide a helpful, neutral response. An apple is typically about **2.5 to 3.5 inches (6–9 cm)** in diameter—roughly the size of a tennis ball or baseball.\n\n**Weight:** Most eating apples weigh between **5–9 ounces (150–250 grams)**.\n\n**Variations by type:**\n- **Small:** Lady apples or crab apples (1–2 inches/2.5–5 cm)\n- **Medium:** Gala, Fuji, or Golden Delicious (2.5–3 inches/6–7.5 cm)\n- **Large:** Honeycrisp, Granny Smith, or cooking apples like Bramley (3.5–4+ inches/9–10 cm)\n\nFor reference, a medium apple is approximately the size of your closed fist. The \"serving size\" used in nutrition labels is typically one medium apple (about 182 grams).", "reasoning_content": "", "tool_calls": null }, "logprobs": null, "finish_reason": "stop", "matched_stop": 163586 } ],

usrlocalben

11 days ago

Hi, I am running Kimi K2.5 using ktransformers and sglang, with the following command on an Amd Epyc 9755 CPU + 768GB DDR5 system + Nvidia RTX 6000 PRO 96Gb GPU. The generation speed is 16 token/sec. The problem is that the model does not return an opening tag. It returns the thinking content with a closing tag followed by the standard response, but I need the opening tag for my clients (Open WebUI, Cline, etc) to operate properly.

Any suggestions on how to solve this?

see here.

courage17340

Moonshot AI org 8 days ago

Check if https://github.com/sgl-project/sglang/pull/17901 can fix the problem.

gyularabai

7 days ago

I got an answer in the LocalLama reddit group that fixed the issue. I have modified the AI requests and injected kwargs using our AI Gateway. I have created a doc about it in case others face the same issue. The problem was that we use Open WebUI, so direct request modification was not possible. Here is the doc:
https://ozeki-ai-gateway.com/p_9177-how-to-fix-missing-think-tag-for-kimi-k2.5.html

Here is the reddit post with the fix:
https://www.reddit.com/r/LocalLLaMA/comments/1qqebfh/kimi_k25_using_ktkernel_sglang_16_tps_but_no/

usrlocalben

7 days ago

we use Open WebUI, so direct request modification was not possible.

It can be done in open-webui.
In the model Advanced Controls, add a Custom value.
The name of the custom value should be chat_template_kwargs and the value should be e.g. {"thinking": false}or whatever you need.

gyularabai

7 days ago

You are correct, just tested. Thanks for the tip.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment