Type something to search...
Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time

Voxtral-4B: Mistral's open-weights TTS model that speaks 9 languages in real time

What is Voxtral-4B?

Voxtral-4B-TTS-2603 is a text-to-speech model released by Mistral AI in March 2026. It converts text to realistic speech in 9 languages, with 20 built-in preset voices and the ability to adapt to a custom voice from a short audio reference.

The model is built on Ministral-3B-Base-2512, Mistral’s compact base model, and pushed to 4B parameters total. It outputs 24 kHz audio in WAV, PCM, FLAC, MP3, AAC, or Opus. Most importantly for production use: it delivers a first-audio chunk in 70 milliseconds at single concurrency — making it genuinely usable in real-time voice pipelines.

It’s released under CC BY-NC 4.0 — free to use for non-commercial applications, with Mistral AI Studio available for commercial access.


Architecture and pipeline

The model follows the architecture pattern of LLM-based TTS systems: a language model backbone that learns to predict audio tokens, which are then decoded into a waveform. The key advantage over classic TTS pipelines is that the LLM backbone already understands language semantics, prosody cues, and emotional context — no separate NLP preprocessing needed.

Voice adaptation works by providing a short reference clip (10 seconds). The model conditions its generation on that audio’s speaker characteristics without fine-tuning.


Languages and voices

9 languages are supported out of the box:

LanguageRegion coverage
EnglishMultiple dialects
French
SpanishMultiple dialects
German
Italian
Portuguese
Dutch
ArabicMultiple dialects
Hindi

20 preset voices are included, ranging in gender, age, and register. Custom voice adaptation is available via Mistral AI Studio or by passing an audio reference directly to the API.


Benchmark results

Tested on a single NVIDIA H200, using 500-character input and a 10-second audio reference, with vLLM v0.18.0:

ConcurrencyFirst-audio latencyRTFThroughput (char/s/GPU)
170 ms0.103119
16331 ms0.237879
32552 ms0.3021,431

RTF (Real-Time Factor) measures how fast the model generates audio relative to the output duration. An RTF of 0.103 at concurrency 1 means the model generates ~10× faster than real time — it produces 1 second of audio in about 103 ms.

At 32 concurrent requests, the model processes 1,431 characters per second on a single GPU — enough for a large-scale voice agent deployment without horizontal scaling.


Deployment with vLLM

Voxtral uses the vLLM-Omni serving backend, which exposes an OpenAI-compatible /audio/speech endpoint. Install and launch:

uv pip install -U vllm  # >= 0.18.0
uv pip install git+https://github.com/vllm-project/vllm-omni.git --upgrade
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni

Making a request:

import io
import httpx
import soundfile as sf

payload = {
    "input": "Paris is a beautiful city!",
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "response_format": "wav",
    "voice": "casual_male",
}

response = httpx.post("http://localhost:8000/v1/audio/speech", json=payload, timeout=120.0)
audio_array, sr = sf.read(io.BytesIO(response.content), dtype="float32")
print(f"Got audio: {len(audio_array)} samples at {sr} Hz")

The OpenAI-compatible API means existing code built against openai.audio.speech works with minimal changes — just swap the base URL and model name.

Hardware requirement: ≥16 GB VRAM (BF16 weights). Runs on a single consumer GPU like an RTX 4090 or a cloud A10.


Use cases

The combination of low latency and multilingual support targets voice agent workflows specifically — scenarios where a user is speaking in real time and expects a sub-second response. At 70 ms to first audio, Voxtral fits into conversational turn-taking without the awkward delay that degrades user experience in most TTS systems.


Limitations

  • CC BY-NC 4.0 — non-commercial only; commercial use requires a Mistral AI agreement
  • No real-time STT included — Voxtral is TTS only; you need a separate speech-to-text model for full voice conversation
  • 16 GB VRAM minimum — rules out lower-end GPUs and CPU-only inference
  • vLLM-Omni dependency — slightly more complex than standard vLLM; requires the separate vllm-omni package for now
  • Voice training datasets (EARS, CML-TTS, IndicVoices-R, Arabic Natural Audio) are also CC BY-NC 4.0

Conclusion

Voxtral-4B fills a specific gap: a small, open-weights TTS model with production-grade latency and genuine multilingual breadth. Most open TTS models either lack multi-language support, have slow inference, or require large GPU setups. Voxtral hits all three constraints at once — 9 languages, 70 ms latency, single GPU.

For developers building voice agents, accessibility tools, or real-time translation applications, this is the most capable open TTS option available at the 4B weight class.

Model: mistralai/Voxtral-4B-TTS-2603 — CC BY-NC 4.0

Tags :
  • AI
  • Mistral
  • TTS
  • Voice
  • Audio
  • Open Source
Share :

Related Posts

ChatGPT: Beware of These Malicious Chrome Extensions

ChatGPT: Beware of These Malicious Chrome Extensions

Are your ChatGPT secrets truly secure? The massive hype surrounding ChatGPT has led to the birth of thousands of Chrome extensions promising to enhance user experience. However, a recent study h

Read More
Agentic AI Smartphones: The Next Frontier for Enterprise

Agentic AI Smartphones: The Next Frontier for Enterprise

The rise of the "doer" AI The recent launch of the ZTE Nubia M153 prototype, powered by ByteDance's Doubao model, marks a decisive turning point. We are moving from passive voice assistants to "

Read More
Claude Opus 4.5: The Next Generation of AI

Claude Opus 4.5: The Next Generation of AI

Introduction to Claude Opus 4.5 Claude Opus 4.5, released on November 25, 2025, represents a significant leap forward in AI technology. This latest version brings a host of new features and impr

Read More
GLM-5: 744B parameters, 40B active — ZhipuAI's open-source frontier model

GLM-5: 744B parameters, 40B active — ZhipuAI's open-source frontier model

What is GLM-5? GLM-5 is a large language model released by ZhipuAI (智谱AI). It has 744 billion total parameters with only 40 billion active at inference — the same Mixture of Experts

Read More
Google Snapseed: A New Photo Experience Arrives on iPhone

Google Snapseed: A New Photo Experience Arrives on iPhone

Introduction: Google surprises mobile photographers Google has just made a major move in the iOS ecosystem by launching a dedicated camera app, directly linked to its famous Snapseed editing suit

Read More
Mistral Small 4: One Unified Model to Rule Reasoning, Code, and Vision

Mistral Small 4: One Unified Model to Rule Reasoning, Code, and Vision

For years, the AI model landscape has operated along a familiar tension: large models that are capable but expensive to run, versus small models that are fast but frustratingly limited. Mistral AI's

Read More
Mistral's Devstral 2: The Return of Sovereign Code AI

Mistral's Devstral 2: The Return of Sovereign Code AI

The European Counter-Strike in Code AI With the release of Devstral 2 and its lightweight counterpart Devstral Small 2, Mistral AI is effectively reclaiming territory in a sector recently domina

Read More
Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

Nemotron Cascade 2: NVIDIA's 30B model that won the math and coding Olympics

What is Nemotron Cascade 2? Nemotron Cascade 2 (30B-A3B) is an open model released by NVIDIA on March 19, 2026. Its headline number is deceptive: 30 billion total parameters, but only **3 bi

Read More
NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU

NVIDIA Nemotron-3 Super: a 120B MoE model that runs on a single GPU

On March 11, 2026, NVIDIA released Nemotron-3 Super — a model that sits at an unusual intersection: 120 billion total parameters, only 12 billion active during inference, deployable on a single G

Read More
Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

What is Qianfan-OCR? Qianfan-OCR is a document understanding model released by Baidu. It converts images of documents — PDFs, scans, photos, screenshots — directly into structured Markdown,

Read More
Qwen3.5-27B Distilled by Claude 4.6 Opus: A Local Reasoning Powerhouse

Qwen3.5-27B Distilled by Claude 4.6 Opus: A Local Reasoning Powerhouse

What is this model? Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is an open-source 28B language model published by Jackrong on Hugging Face. The idea is

Read More
Project Ava: Razer Traps an AI in a Connected Jar

Project Ava: Razer Traps an AI in a Connected Jar

AI steps out of the screen with Razer Beyond RGB mice and keyboards, Razer is exploring new horizons with Project Ava. This concept, introduced as an "AI companion in a jar," aims to humaniz

Read More
Technology (definition)

Technology (definition)

Technology and ecology: a sustainable alliance At Reeboot, we firmly believe that technology and ecology can go hand in hand. Our mission is to provide high-performance products while adopting a

Read More
The Asus ROG Strix SCAR 18 Monster, VPN and Health: Today's Tech News

The Asus ROG Strix SCAR 18 Monster, VPN and Health: Today's Tech News

Introduction: a concentration of innovations and vigilance The world of technology never stops, and this morning, the news offers us a fascinating mix of raw performance, digital geopolitics, and

Read More
Windows 11: Your Android Apps Now in Full Screen on PC

Windows 11: Your Android Apps Now in Full Screen on PC

Breaking the barriers between mobile and PC Microsoft is taking another major step in unifying its operating systems. Thanks to an update to the "Phone Link" tool, users can now project their An

Read More