render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
#2
by
imweijh - opened
build-cuda/bin/llama-server --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
version: 8120 (07968d53e)
built with GNU 13.3.0 for Linux aarch64
build-cuda/bin/llama-server -m ~/.cache/llama.cpp/AesSedai_MiniMax-M2.5-IQ3_S-00001-of-00003.gguf -c 32000 --no-mmap
slot load_model: id 3 | task -1 | speculative decoding context not initialized
slot load_model: id 3 | task -1 | new slot, n_ctx = 32000
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use `--cache-ram 0` to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
init: chat template, example_format: ']~!b[]~b]system
You are a helpful assistant[e~[
]~b]user
Hello[e~[
]~b]ai
Hi there[e~[
]~b]user
How are you?[e~[
]~b]ai
<think>
'
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
render_message_to_json: Neither string content nor typed content is supported by the template. This is unexpected and may lead to issues.
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
First, thank you for your GGUF quantization.
There’s a small warning when running, and using Unsloth’s chat template seems to solve the problem.
Try using the autoparser branch on llama.cpp from pwilkin: https://github.com/ggml-org/llama.cpp/pull/18675