When I want to query the topic of weather, but the most relative is "Gaming"
$ python3 -c "
import requests
import json
queries = [
'Search for weather information',
'get weather forecast',
'weather report',
'temperature today'
]
weather_text = 'Current weather conditions and forecasts'
gaming_text = 'Get List of all Lost Ark Cards details'
for query in queries:
payload = {'model': 'KaLM', 'input': query}
resp = requests.post('http://localhost:8001/v1/embeddings', json=payload)
query_emb = resp.json()['data'][0]['embedding']
for i, text in enumerate([weather_text, gaming_text]):
payload2 = {'model': 'KaLM', 'input': text}
resp2 = requests.post('http://localhost:8001/v1/embeddings', json=payload2)
emb = resp2.json()['data'][0]['embedding']
sim = sum([q*e for q,e in zip(query_emb, emb)]) / (sum([q**2 for q in query_emb])**0.5 * sum([e**2 for e in emb])**0.5)
print(f'{query} -> {\"Weather\" if i==0 else \"Gaming\"}: {sim:.4f}')
print()
"
Search for weather information -> Weather: 0.7216
Search for weather information -> Gaming: 0.9537
get weather forecast -> Weather: 0.7007
get weather forecast -> Gaming: 0.9006
weather report -> Weather: 0.6919
weather report -> Gaming: 0.8948
temperature today -> Weather: 0.6992
temperature today -> Gaming: 0.8919
I tested it on my own deployed service using the data you provided. The results appear to be consistent with expectations:
Search for weather information -> Weather: 0.8248
get weather forecast -> Gaming: 0.5281
Search for weather information -> Weather: 0.7134
get weather forecast -> Gaming: 0.4050
Search for weather information -> Weather: 0.7887
get weather forecast -> Gaming: 0.4070
Search for weather information -> Weather: 0.7154
get weather forecast -> Gaming: 0.3772
The code I ran is as follows (where my_get_embedding_func is the function for calling my own vLLM deployed service):
queries = [
'Search for weather information',
'get weather forecast',
'weather report',
'temperature today'
]
weather_text = 'Current weather conditions and forecasts'
gaming_text = 'Get List of all Lost Ark Cards details'
for query in queries:
query_emb = my_get_embedding_func(query)
for i, text in enumerate([weather_text, gaming_text]):
emb = my_get_embedding_func(text)
sim = sum([q*e for q,e in zip(query_emb, emb)]) / (sum([q**2 for q in query_emb])**0.5 * sum([e**2 for e in emb])**0.5)
print(f"""{queries[i]} -> {"Weather" if i==0 else "Gaming"}: {sim:.4f}""")
print()
Could you please provide details on your model deployment? It would be helpful to verify if the parameters are being loaded correctly.
the response you get seems to be expected, but mine is bad. And When I using the same vllm config and code to test on Qwen3-Embedding-8B, nothing out of expected happens. So I don't know why. The detailed are following:
=======KaLM-Embedding-Gemma3-12B-2511========
vllm serve the/path/to/KaLM-Embedding-Gemma3-12B-2511
--served-model-name "KaLM"
--trust-remote-code
--dtype auto
--tensor-parallel-size 1
--max-model-len 32768
--gpu-memory-utilization 0.95
--port 8001
$ python3 -c "
import requests
import json
queries = [
'Search for weather information',
'get weather forecast',
'weather report',
'temperature today'
]
weather_text = 'Current weather conditions and forecasts'
gaming_text = 'Get List of all Lost Ark Cards details'
for query in queries:
payload = {'model': 'KaLM', 'input': query}
resp = requests.post('http://localhost:8001/v1/embeddings', json=payload)
query_emb = resp.json()['data'][0]['embedding']
for i, text in enumerate([weather_text, gaming_text]):
payload2 = {'model': 'KaLM', 'input': text}
resp2 = requests.post('http://localhost:8001/v1/embeddings', json=payload2)
emb = resp2.json()['data'][0]['embedding']
sim = sum([q*e for q,e in zip(query_emb, emb)]) / (sum([q**2 for q in query_emb])0.5 * sum([e2 for e in emb])**0.5)
print(f'{query} -> {"Weather" if i==0 else "Gaming"}: {sim:.4f}')
print()
"
Search for weather information -> Weather: 0.7216
Search for weather information -> Gaming: 0.9537
get weather forecast -> Weather: 0.7007
get weather forecast -> Gaming: 0.9006
weather report -> Weather: 0.6919
weather report -> Gaming: 0.8948
temperature today -> Weather: 0.6992
temperature today -> Gaming: 0.8919
===============Qwen3-Embedding===============
python3 -c "
import requests
import json
queries = [
'Search for weather information',
'get weather forecast',
'weather report',
'temperature today'
]
weather_text = 'Current weather conditions and forecasts'
gaming_text = 'Get List of all Lost Ark Cards details'
for query in queries:
payload = {'model': 'Qwen3-Embedding', 'input': query}
resp = requests.post('http://localhost:8001/v1/embeddings', json=payload)
query_emb = resp.json()['data'][0]['embedding']
for i, text in enumerate([weather_text, gaming_text]):
payload2 = {'model': 'Qwen3-Embedding', 'input': text}
resp2 = requests.post('http://localhost:8001/v1/embeddings', json=payload2)
emb = resp2.json()['data'][0]['embedding']
sim = sum([q*e for q,e in zip(query_emb, emb)]) / (sum([q**2 for q in query_emb])**0.5 * sum([e**2 for e in emb])**0.5)
print(f'{query} -> {\"Weather\" if i==0 else \"Gaming\"}: {sim:.4f}')
print()
"
Search for weather information -> Weather: 0.8688
Search for weather information -> Gaming: 0.4566
get weather forecast -> Weather: 0.8422
get weather forecast -> Gaming: 0.4513
weather report -> Weather: 0.8197
weather report -> Gaming: 0.4280
temperature today -> Weather: 0.7379
temperature today -> Gaming: 0.4008
I miss the vllm cli for qwen
vllm serve the/path/to/Qwen3-Embedding-8B
--served-model-name "Qwen3-Embedding"
--trust-remote-code
--dtype auto
--tensor-parallel-size 1
--max-model-len 32768
--gpu-memory-utilization 0.95
--port 8001
Oh, it looks like you are using vLLM.
Just to confirm, did you notice the explanation here: https://huggingface.co/tencent/KaLM-Embedding-Gemma3-12B-2511#vllm-support?
When loading with vLLM, you need to use the parameters from the CausalLM branch.
I notice the explanation early , but maybe i need to figure out whether i use the branch,Thank you for your inspiring work and the patience to response me.
Hi there, just checking back in—has the issue here been identified and fixed yet?
