pandora-s's picture
Update README.md
35ac9a5 verified
|
raw
history blame
8.24 kB
metadata
language:
  - en
  - fr
  - de
  - es
  - it
  - pt
  - nl
  - hi
license: apache-2.0
library_name: vllm
inference: false
base_model:
  - mistralai/Mistral-Small-24B-Base-2501
extra_gated_description: >-
  If you want to learn more about how we process your personal data, please read
  our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
pipeline_tag: audio-text-to-text

Voxtral Small 24B-2507

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription and understanding.

Learn more about Voxtral in our blog post here.

Both Voxtral models go beyond transcription with capabilities that include:

Key Features

Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.

  • Long-form context: with a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
  • Built-in Q&A and summarization: Supports asking questions directly about the audio content or generating structured summaries, without the need to chain separate ASR and language models
  • Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, to name a few), helping teams serve global audiences with a single system
  • Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents, turning voice interactions into actionable system commands without intermediate parsing steps.
  • Highly capable at text: Retains the text understanding capabilities of its language model backbone, Mistral Small 3

Benchmark Results

Audio

image/png

image/png

image/png

image/png

Text

Usage

The model can be used with the following frameworks;

Note 1: We recommend using a relatively low temperature, such as temperature=0.15.

Note 2: Make sure to add a system prompt to the model to best tailor it to your needs.

vLLM (recommended)

We recommend using this model with vLLM.

Installation

Make sure to install vLLM >= 0.#.#:

pip install vllm --upgrade

Doing so should automatically install mistral_common >= 1.#.#.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Serve

We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.

  1. Spin up a server:
vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2

Note: Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.

  1. To ping the client you can use a simple Python snippet. See the following examples.

Audio Instruct

Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.

Python snippet
TODO

Transcription

Voxtral-Small-24B-2507 has powerfull transcription capabilities!

Python snippet
TODO

Function calling

Voxtral-Small-24B-2507 is excellent at function / tool calling tasks via vLLM. E.g.:

Python snippet

ORIGINAL

VLLM_USE_PRECOMPILED=1 pip install --editable .\[audio\]

of: https://github.com/vllm-project/vllm/pull/20970#pullrequestreview-3019578541

Examples

Client/Server

Server

vllm serve mistralai/voxtral-small --tokenizer_mode mistral --config_format mistral --load_format mistral --max_model_len 32768

Client - Chat

#!/usr/bin/env python3
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")

def file_to_chunk(file: str) -> AudioChunk:
    audio = Audio.from_file(file, strict=False)
    return AudioChunk.from_audio(audio)

text_chunk = TextChunk(text="Which speaker do you prefer between the two? Why? How are they different from each other?")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()

print(30 * "=" + "USER 1" + 30 * "=")
print(text_chunk.text)
print("\n\n")

response = client.chat.completions.create(
    model=model,
    messages=[user_msg],
    temperature=0.0,
    max_tokens=32768,
)
content = response.choices[0].message.content

print(30 * "=" + "BOT 1" + 30 * "=")
print(content)
print("\n\n")
# "The speaker who delivers the farewell address is more engaging and inspiring. They express gratitude and optimism, emphasizing the importance of self-government and citizenship. They also share personal experiences and observations, making the speech more relatable and heartfelt. In contrast, the second speaker provides factual information about the weather in Barcelona, which is less engaging and lacks the emotional depth of the first speaker's address."
#

messages = [
    user_msg,
    AssistantMessage(content=content).to_openai(),
    UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
]
print(30 * "=" + "USER 2" + 30 * "=")
print(messages[-1]["content"])
print("\n\n")

response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.2,
    max_tokens=32768,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 2" + 30 * "=")
print(content)

Client - Transcribe

from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.messages import RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)

audio = RawAudio.from_audio(audio)
req = TranscriptionRequest(model=model, audio=audio, language="en").to_openai(exclude=("top_p", "seed"))

response = client.audio.transcriptions.create(**req)
print(response)