language:
- en
- fr
- de
- es
- it
- pt
- nl
- hi
license: apache-2.0
library_name: vllm
inference: false
base_model:
- mistralai/Mistral-Small-24B-Base-2501
extra_gated_description: >-
If you want to learn more about how we process your personal data, please read
our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
pipeline_tag: audio-text-to-text
Voxtral Small 24B-2507
Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription and understanding.
Learn more about Voxtral in our blog post here.
Both Voxtral models go beyond transcription with capabilities that include:
Key Features
Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
- Long-form context: with a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
- Built-in Q&A and summarization: Supports asking questions directly about the audio content or generating structured summaries, without the need to chain separate ASR and language models
- Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, to name a few), helping teams serve global audiences with a single system
- Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents, turning voice interactions into actionable system commands without intermediate parsing steps.
- Highly capable at text: Retains the text understanding capabilities of its language model backbone, Mistral Small 3
Benchmark Results
Audio
Text
Usage
The model can be used with the following frameworks;
vllm (recommended): See here
Note 1: We recommend using a relatively low temperature, such as temperature=0.15.
Note 2: Make sure to add a system prompt to the model to best tailor it to your needs.
vLLM (recommended)
We recommend using this model with vLLM.
Installation
Make sure to install vLLM >= 0.#.#:
pip install vllm --upgrade
Doing so should automatically install mistral_common >= 1.#.#.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve
We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
- Spin up a server:
vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2
Note: Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
- To ping the client you can use a simple Python snippet. See the following examples.
Audio Instruct
Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
Python snippet
TODO
Transcription
Voxtral-Small-24B-2507 has powerfull transcription capabilities!
Python snippet
TODO
Function calling
Voxtral-Small-24B-2507 is excellent at function / tool calling tasks via vLLM. E.g.:
Python snippet
ORIGINAL
VLLM_USE_PRECOMPILED=1 pip install --editable .\[audio\]
of: https://github.com/vllm-project/vllm/pull/20970#pullrequestreview-3019578541
Examples
Client/Server
Server
vllm serve mistralai/voxtral-small --tokenizer_mode mistral --config_format mistral --load_format mistral --max_model_len 32768
Client - Chat
#!/usr/bin/env python3
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
def file_to_chunk(file: str) -> AudioChunk:
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)
text_chunk = TextChunk(text="Which speaker do you prefer between the two? Why? How are they different from each other?")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
print(30 * "=" + "USER 1" + 30 * "=")
print(text_chunk.text)
print("\n\n")
response = client.chat.completions.create(
model=model,
messages=[user_msg],
temperature=0.0,
max_tokens=32768,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 1" + 30 * "=")
print(content)
print("\n\n")
# "The speaker who delivers the farewell address is more engaging and inspiring. They express gratitude and optimism, emphasizing the importance of self-government and citizenship. They also share personal experiences and observations, making the speech more relatable and heartfelt. In contrast, the second speaker provides factual information about the weather in Barcelona, which is less engaging and lacks the emotional depth of the first speaker's address."
#
messages = [
user_msg,
AssistantMessage(content=content).to_openai(),
UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
]
print(30 * "=" + "USER 2" + 30 * "=")
print(messages[-1]["content"])
print("\n\n")
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.2,
max_tokens=32768,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 2" + 30 * "=")
print(content)
Client - Transcribe
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.messages import RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)
audio = RawAudio.from_audio(audio)
req = TranscriptionRequest(model=model, audio=audio, language="en").to_openai(exclude=("top_p", "seed"))
response = client.audio.transcriptions.create(**req)
print(response)



