Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time
- Bastien
- 30 Mar, 2026
What is Voxtral-4B?
Voxtral-4B-TTS-2603 is a text-to-speech model released by Mistral AI in March 2026. It converts text to realistic speech in 9 languages, with 20 built-in preset voices and the ability to adapt to a custom voice from a short audio reference.
The model is built on Ministral-3B-Base-2512, Mistral’s compact base model, and pushed to 4B parameters total. It outputs 24 kHz audio in WAV, PCM, FLAC, MP3, AAC, or Opus. Most importantly for production use: it delivers a first-audio chunk in 70 milliseconds at single concurrency — making it genuinely usable in real-time voice pipelines.
It’s released under CC BY-NC 4.0 — free to use for non-commercial applications, with Mistral AI Studio available for commercial access.
Architecture and pipeline
The model follows the architecture pattern of LLM-based TTS systems: a language model backbone that learns to predict audio tokens, which are then decoded into a waveform. The key advantage over classic TTS pipelines is that the LLM backbone already understands language semantics, prosody cues, and emotional context — no separate NLP preprocessing needed.
Voice adaptation works by providing a short reference clip (10 seconds). The model conditions its generation on that audio’s speaker characteristics without fine-tuning.
Languages and voices
9 languages are supported out of the box:
| Language | Region coverage |
|---|---|
| English | Multiple dialects |
| French | — |
| Spanish | Multiple dialects |
| German | — |
| Italian | — |
| Portuguese | — |
| Dutch | — |
| Arabic | Multiple dialects |
| Hindi | — |
20 preset voices are included, ranging in gender, age, and register. Custom voice adaptation is available via Mistral AI Studio or by passing an audio reference directly to the API.
Benchmark results
Tested on a single NVIDIA H200, using 500-character input and a 10-second audio reference, with vLLM v0.18.0:
| Concurrency | First-audio latency | RTF | Throughput (char/s/GPU) |
|---|---|---|---|
| 1 | 70 ms | 0.103 | 119 |
| 16 | 331 ms | 0.237 | 879 |
| 32 | 552 ms | 0.302 | 1,431 |
RTF (Real-Time Factor) measures how fast the model generates audio relative to the output duration. An RTF of 0.103 at concurrency 1 means the model generates ~10× faster than real time — it produces 1 second of audio in about 103 ms.
At 32 concurrent requests, the model processes 1,431 characters per second on a single GPU — enough for a large-scale voice agent deployment without horizontal scaling.
Deployment with vLLM
Voxtral uses the vLLM-Omni serving backend, which exposes an OpenAI-compatible /audio/speech endpoint. Install and launch:
uv pip install -U vllm # >= 0.18.0
uv pip install git+https://github.com/vllm-project/vllm-omni.git --upgrade
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni
Making a request:
import io
import httpx
import soundfile as sf
payload = {
"input": "Paris is a beautiful city!",
"model": "mistralai/Voxtral-4B-TTS-2603",
"response_format": "wav",
"voice": "casual_male",
}
response = httpx.post("http://localhost:8000/v1/audio/speech", json=payload, timeout=120.0)
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")
The OpenAI-compatible API means existing code built against openai.audio.speech works with minimal changes — just swap the base URL and model name.
Hardware requirement: ≥16 GB VRAM (BF16 weights). Runs on a single consumer GPU like an RTX 4090 or a cloud A10.
Use cases
The combination of low latency and multilingual support targets voice agent workflows specifically — scenarios where a user is speaking in real time and expects a sub-second response. At 70 ms to first audio, Voxtral fits into conversational turn-taking without the awkward delay that degrades user experience in most TTS systems.
Limitations
- CC BY-NC 4.0 — non-commercial only; commercial use requires a Mistral AI agreement
- No real-time STT included — Voxtral is TTS only; you need a separate speech-to-text model for full voice conversation
- 16 GB VRAM minimum — rules out lower-end GPUs and CPU-only inference
- vLLM-Omni dependency — slightly more complex than standard vLLM; requires the separate
vllm-omnipackage for now - Voice training datasets (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio) are also CC BY-NC 4.0
Conclusion
Voxtral-4B fills a specific gap: a small, open-weights TTS model with production-grade latency and genuine multilingual breadth. Most open TTS models either lack multi-language support, have slow inference, or require large GPU setups. Voxtral hits all three constraints at once — 9 languages, 70 ms latency, single GPU.
For developers building voice agents, accessibility tools, or real-time translation applications, this is the most capable open TTS option available at the 4B weight class.
Model: mistralai/Voxtral-4B-TTS-2603 — CC BY-NC 4.0
Tags :
- AI
- Mistral
- TTS
- Voice
- Audio
- Open Source