Cohere Transcribe: a 2B ASR model that tops the English leaderboard
- Bastien
- 03 Apr, 2026
What is Cohere Transcribe?
Cohere Transcribe 03-2026 is an automatic speech recognition (ASR) model released by Cohere Labs. With 2B parameters, it ranks #1 on the English ASR leaderboard as of March 2026, achieving a 5.42 average Word Error Rate (WER) across 8 benchmarks — while running at 524x real-time speed (RTFx), roughly 3x faster than comparable models.
It supports 14 languages, handles long-form audio through automatic chunking, and is available under the Apache 2.0 license.
Architecture
- Conformer encoder — combines convolutional and self-attention layers, making it effective for capturing both local acoustic features and long-range temporal dependencies.
- Transformer decoder — lightweight design keeps the model fast while maintaining text quality.
- Training — trained from scratch using supervised cross-entropy; no Whisper-based distillation.
- Total size: 2B parameters.
Language support
14 languages:
| Region | Languages |
|---|---|
| Europe | English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish |
| Asia-Pacific | Chinese (Mandarin), Japanese, Korean, Vietnamese |
| MENA | Arabic |
Note: language must be specified explicitly — there is no automatic language detection.
Benchmark results
English ASR leaderboard — #1 overall (March 2026)
| Model | Avg WER ↓ | AMI | Earnings22 | Gigaspeech | LS Clean | LS Other | SPGISpeech | TedLium | VoxPopuli |
|---|---|---|---|---|---|---|---|---|---|
| Cohere Transcribe | 5.42 | 8.15 | 10.84 | 9.33 | 1.25 | 2.37 | 3.08 | 2.49 | 5.87 |
Lower WER is better. The model leads on 3 of 8 benchmarks and takes the #1 overall average.
Throughput:
| Metric | Value |
|---|---|
| RTFx (Real-Time Factor) | 524.88 |
| Speed vs comparable models | ~3x faster |
RTFx of 524.88 means 1 second of audio is transcribed in ~1.9 milliseconds.
Usage
Install:
pip install transformers>=5.4.0 torch huggingface_hub soundfile librosa sentencepiece protobuf
Basic transcription:
from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio
processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained(
"CohereLabs/cohere-transcribe-03-2026",
device_map="auto"
)
audio = load_audio("audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)
Long-form audio (automatic chunking):
from datasets import load_dataset
import time
ds = load_dataset("distil-whisper/earnings22", "full", split="test", streaming=True)
sample = next(iter(ds))
audio_array = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]
duration_s = len(audio_array) / sr
inputs = processor(audio=audio_array, sampling_rate=sr, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index)[0]
elapsed = time.time() - start
print(f"Transcribed {duration_s:.0f}s of audio in {elapsed:.1f}s (RTFx: {duration_s/elapsed:.0f}x)")
print(text)
Batched inference:
from transformers.audio_utils import load_audio
audio_short = load_audio("short.mp3", sampling_rate=16000)
audio_long = load_audio("long.mp3", sampling_rate=16000)
inputs = processor(
[audio_short, audio_long],
sampling_rate=16000,
return_tensors="pt",
language="en"
)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs, max_new_tokens=256)
texts = processor.decode(outputs, skip_special_tokens=True,
audio_chunk_index=audio_chunk_index, language="en")
print(texts)
Punctuation control:
# With punctuation (default)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt",
language="en", punctuation=True)
# Without punctuation (lowercase, no punctuation marks — useful for downstream NLP)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt",
language="en", punctuation=False)
Optimized throughput with compilation:
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
model_id = "CohereLabs/cohere-transcribe-03-2026"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, trust_remote_code=True).cuda().eval()
texts = model.transcribe(
processor=processor,
audio_arrays=[audio_array],
sample_rates=[sr],
language="en",
compile=True, # torch.compile for higher throughput
pipeline_detokenization=True,
batch_size=16
)
print(texts[0])
Production deployment with vLLM
# Install
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
uv pip install vllm[audio] librosa
# Start server
vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code
# Send a request
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-H "Authorization: Bearer $VLLM_API_KEY" \
-F "file=@audio.wav" \
-F "model=CohereLabs/cohere-transcribe-03-2026"
Ecosystem
| Platform | Status |
|---|---|
| Hugging Face Transformers | Native support |
| vLLM | Production serving |
| mlx-audio | Apple Silicon |
| Rust | cohere_transcribe_rs |
| Browser | transformers.js + WebGPU |
| Chrome extension | cohere_transcribe_extension |
| iOS | Whisper Memos |
18 quantized variants are also available on the Hub.
Limitations
- No automatic language detection — you must specify the language code upfront; the model will not switch languages mid-audio.
- No timestamps or speaker diarization — if you need word-level timestamps or who-said-what, you’ll need a separate pipeline.
- Silence handling — the model may attempt to transcribe non-speech sounds; a VAD (voice activity detection) preprocessing step is recommended for noisy environments.
- Code-switching — inconsistent on audio that switches between languages within the same utterance.
Conclusion
Cohere Transcribe 03-2026 makes a clear case on benchmarks: #1 WER on the English ASR leaderboard, 3x faster than comparable models, under Apache 2.0. For teams building transcription pipelines — meeting notes, call center analytics, subtitle generation — this is now the strongest open-weights option at any size.
The automatic chunking for long-form audio, punctuation control, and broad ecosystem support (vLLM, Apple Silicon, browser, mobile) make it practical across a wide range of deployment scenarios.
Tags :
- AI
- Cohere
- ASR
- Speech Recognition
- Audio
- Open Source