Voxtral 4B TTS: Mistral AI's multilingual text-to-speech model

What is Voxtral-4B?

Voxtral-4B-TTS-2603 is a text-to-speech model released by Mistral AI in March 2026. It converts text to realistic speech in 9 languages, with 20 built-in preset voices and the ability to adapt to a custom voice from a short audio reference.

The model is built on Ministral-3B-Base-2512, Mistral's compact base model, and pushed to 4B parameters total. It outputs 24 kHz audio in WAV, PCM, FLAC, MP3, AAC, or Opus. Most importantly for production use: it delivers a first-audio chunk in 70 milliseconds at single concurrency — making it genuinely usable in real-time voice pipelines.

It's released under CC BY-NC 4.0 — free to use for non-commercial applications, with Mistral AI Studio available for commercial access.

Architecture and pipeline

The model follows the architecture pattern of LLM-based TTS systems: a language model backbone that learns to predict audio tokens, which are then decoded into a waveform. The key advantage over classic TTS pipelines is that the LLM backbone already understands language semantics, prosody cues, and emotional context — no separate NLP preprocessing needed.

Voice adaptation works by providing a short reference clip (10 seconds). The model conditions its generation on that audio's speaker characteristics without fine-tuning.

Languages and voices

9 languages are supported out of the box:

Language	Region coverage
English	Multiple dialects
French	—
Spanish	Multiple dialects
German	—
Italian	—
Portuguese	—
Dutch	—
Arabic	Multiple dialects
Hindi	—

20 preset voices are included, ranging in gender, age, and register. Custom voice adaptation is available via Mistral AI Studio or by passing an audio reference directly to the API.

Benchmark results

Tested on a single NVIDIA H200, using 500-character input and a 10-second audio reference, with vLLM v0.18.0:

Concurrency	First-audio latency	RTF	Throughput (char/s/GPU)
1	70 ms	0.103	119
16	331 ms	0.237	879
32	552 ms	0.302	1,431

RTF (Real-Time Factor) measures how fast the model generates audio relative to the output duration. An RTF of 0.103 at concurrency 1 means the model generates ~10× faster than real time — it produces 1 second of audio in about 103 ms.

At 32 concurrent requests, the model processes 1,431 characters per second on a single GPU — enough for a large-scale voice agent deployment without horizontal scaling.

Deployment with vLLM

Voxtral uses the vLLM-Omni serving backend, which exposes an OpenAI-compatible /audio/speech endpoint. Install and launch:

uv pip install -U vllm  # >= 0.18.0
uv pip install git+https://github.com/vllm-project/vllm-omni.git --upgrade
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni

Making a request:

import io
import httpx
import soundfile as sf

payload = {
    "input": "Paris is a beautiful city!",
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "response_format": "wav",
    "voice": "casual_male",
}

response = httpx.post("http://localhost:8000/v1/audio/speech", json=payload, timeout=120.0)
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")

The OpenAI-compatible API means existing code built against openai.audio.speech works with minimal changes — just swap the base URL and model name.

Hardware requirement: ≥16 GB VRAM (BF16 weights). Runs on a single consumer GPU like an RTX 4090 or a cloud A10.

Use cases

The combination of low latency and multilingual support targets voice agent workflows specifically — scenarios where a user is speaking in real time and expects a sub-second response. At 70 ms to first audio, Voxtral fits into conversational turn-taking without the awkward delay that degrades user experience in most TTS systems.

Limitations

CC BY-NC 4.0 — non-commercial only; commercial use requires a Mistral AI agreement
No real-time STT included — Voxtral is TTS only; you need a separate speech-to-text model for full voice conversation
16 GB VRAM minimum — rules out lower-end GPUs and CPU-only inference
vLLM-Omni dependency — slightly more complex than standard vLLM; requires the separate vllm-omni package for now
Voice training datasets (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio) are also CC BY-NC 4.0

Conclusion

Voxtral-4B fills a specific gap: a small, open-weights TTS model with production-grade latency and genuine multilingual breadth. Most open TTS models either lack multi-language support, have slow inference, or require large GPU setups. Voxtral hits all three constraints at once — 9 languages, 70 ms latency, single GPU.

For developers building voice agents, accessibility tools, or real-time translation applications, this is the most capable open TTS option available at the 4B weight class.

Model: mistralai/Voxtral-4B-TTS-2603 — CC BY-NC 4.0

Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time