Cohere Transcribe 03-2026: #1 WER, 14 languages, 3x faster than Whisper

What is Cohere Transcribe?

Cohere Transcribe 03-2026 is an automatic speech recognition (ASR) model released by Cohere Labs. With 2B parameters, it ranks #1 on the English ASR leaderboard as of March 2026, achieving a 5.42 average Word Error Rate (WER) across 8 benchmarks — while running at 524x real-time speed (RTFx), roughly 3x faster than comparable models.

It supports 14 languages, handles long-form audio through automatic chunking, and is available under the Apache 2.0 license.

Architecture

Conformer encoder — combines convolutional and self-attention layers, making it effective for capturing both local acoustic features and long-range temporal dependencies.
Transformer decoder — lightweight design keeps the model fast while maintaining text quality.
Training — trained from scratch using supervised cross-entropy; no Whisper-based distillation.
Total size: 2B parameters.

Language support

14 languages:

Region	Languages
Europe	English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
Asia-Pacific	Chinese (Mandarin), Japanese, Korean, Vietnamese
MENA	Arabic

Note: language must be specified explicitly — there is no automatic language detection.

Benchmark results

English ASR leaderboard — #1 overall (March 2026)

Model	Avg WER ↓	AMI	Earnings22	Gigaspeech	LS Clean	LS Other	SPGISpeech	TedLium	VoxPopuli
Cohere Transcribe	5.42	8.15	10.84	9.33	1.25	2.37	3.08	2.49	5.87

Lower WER is better. The model leads on 3 of 8 benchmarks and takes the #1 overall average.

Throughput:

Metric	Value
RTFx (Real-Time Factor)	524.88
Speed vs comparable models	~3x faster

RTFx of 524.88 means 1 second of audio is transcribed in ~1.9 milliseconds.

Usage

Install:

pip install transformers>=5.4.0 torch huggingface_hub soundfile librosa sentencepiece protobuf

Basic transcription:

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained(
    "CohereLabs/cohere-transcribe-03-2026",
    device_map="auto"
)

audio = load_audio("audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

Long-form audio (automatic chunking):

from datasets import load_dataset
import time

ds = load_dataset("distil-whisper/earnings22", "full", split="test", streaming=True)
sample = next(iter(ds))

audio_array = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]
duration_s = len(audio_array) / sr

inputs = processor(audio=audio_array, sampling_rate=sr, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index)[0]
elapsed = time.time() - start

print(f"Transcribed {duration_s:.0f}s of audio in {elapsed:.1f}s (RTFx: {duration_s/elapsed:.0f}x)")
print(text)

Batched inference:

from transformers.audio_utils import load_audio

audio_short = load_audio("short.mp3", sampling_rate=16000)
audio_long = load_audio("long.mp3", sampling_rate=16000)

inputs = processor(
    [audio_short, audio_long],
    sampling_rate=16000,
    return_tensors="pt",
    language="en"
)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
texts = processor.decode(outputs, skip_special_tokens=True,
                         audio_chunk_index=audio_chunk_index, language="en")
print(texts)

Punctuation control:

# With punctuation (default)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt",
                   language="en", punctuation=True)

# Without punctuation (lowercase, no punctuation marks — useful for downstream NLP)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt",
                   language="en", punctuation=False)

Optimized throughput with compilation:

import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

model_id = "CohereLabs/cohere-transcribe-03-2026"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, trust_remote_code=True).cuda().eval()

texts = model.transcribe(
    processor=processor,
    audio_arrays=[audio_array],
    sample_rates=[sr],
    language="en",
    compile=True,              # torch.compile for higher throughput
    pipeline_detokenization=True,
    batch_size=16
)
print(texts[0])

Production deployment with vLLM

# Install
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
uv pip install vllm[audio] librosa

# Start server
vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code

# Send a request
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -F "file=@audio.wav" \
  -F "model=CohereLabs/cohere-transcribe-03-2026"

Ecosystem

Platform	Status
Hugging Face Transformers	Native support
vLLM	Production serving
mlx-audio	Apple Silicon
Rust	`cohere_transcribe_rs`
Browser	`transformers.js` + WebGPU
Chrome extension	`cohere_transcribe_extension`
iOS	Whisper Memos

18 quantized variants are also available on the Hub.

Limitations

No automatic language detection — you must specify the language code upfront; the model will not switch languages mid-audio.
No timestamps or speaker diarization — if you need word-level timestamps or who-said-what, you'll need a separate pipeline.
Silence handling — the model may attempt to transcribe non-speech sounds; a VAD (voice activity detection) preprocessing step is recommended for noisy environments.
Code-switching — inconsistent on audio that switches between languages within the same utterance.

Conclusion

Cohere Transcribe 03-2026 makes a clear case on benchmarks: #1 WER on the English ASR leaderboard, 3x faster than comparable models, under Apache 2.0. For teams building transcription pipelines — meeting notes, call center analytics, subtitle generation — this is now the strongest open-weights option at any size.

The automatic chunking for long-form audio, punctuation control, and broad ecosystem support (vLLM, Apple Silicon, browser, mobile) make it practical across a wide range of deployment scenarios.

Model: CohereLabs/cohere-transcribe-03-2026

Cohere Transcribe: a 2B ASR model that tops the English leaderboard