>_Reeboot
Cohere Transcribe: a 2B ASR model that tops the English leaderboard
AI

Cohere Transcribe: a 2B ASR model that tops the English leaderboard

Cohere Labs' Transcribe 03-2026 is a 2B Conformer-based ASR model ranked #1 on the English ASR leaderboard with a 5.42 average WER, supporting 14 languages at 524x real-time speed β€” faster and more ac

What is Cohere Transcribe?

Cohere Transcribe 03-2026 is an automatic speech recognition (ASR) model released by Cohere Labs. With 2B parameters, it ranks #1 on the English ASR leaderboard as of March 2026, achieving a 5.42 average Word Error Rate (WER) across 8 benchmarks β€” while running at 524x real-time speed (RTFx), roughly 3x faster than comparable models.

It supports 14 languages, handles long-form audio through automatic chunking, and is available under the Apache 2.0 license.


Architecture

  • Conformer encoder β€” combines convolutional and self-attention layers, making it effective for capturing both local acoustic features and long-range temporal dependencies.
  • Transformer decoder β€” lightweight design keeps the model fast while maintaining text quality.
  • Training β€” trained from scratch using supervised cross-entropy; no Whisper-based distillation.
  • Total size: 2B parameters.

Language support

14 languages:

Region Languages
Europe English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
Asia-Pacific Chinese (Mandarin), Japanese, Korean, Vietnamese
MENA Arabic

Note: language must be specified explicitly β€” there is no automatic language detection.


Benchmark results

English ASR leaderboard β€” #1 overall (March 2026)

Model Avg WER ↓ AMI Earnings22 Gigaspeech LS Clean LS Other SPGISpeech TedLium VoxPopuli
Cohere Transcribe 5.42 8.15 10.84 9.33 1.25 2.37 3.08 2.49 5.87

Lower WER is better. The model leads on 3 of 8 benchmarks and takes the #1 overall average.

Throughput:

Metric Value
RTFx (Real-Time Factor) 524.88
Speed vs comparable models ~3x faster

RTFx of 524.88 means 1 second of audio is transcribed in ~1.9 milliseconds.


Usage

Install:

pip install transformers>=5.4.0 torch huggingface_hub soundfile librosa sentencepiece protobuf

Basic transcription:

from transformers import AutoProcessor, CohereAsrForConditionalGeneration
from transformers.audio_utils import load_audio

processor = AutoProcessor.from_pretrained("CohereLabs/cohere-transcribe-03-2026")
model = CohereAsrForConditionalGeneration.from_pretrained(
    "CohereLabs/cohere-transcribe-03-2026",
    device_map="auto"
)

audio = load_audio("audio.wav", sampling_rate=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="en")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True)
print(text)

Long-form audio (automatic chunking):

from datasets import load_dataset
import time

ds = load_dataset("distil-whisper/earnings22", "full", split="test", streaming=True)
sample = next(iter(ds))

audio_array = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]
duration_s = len(audio_array) / sr

inputs = processor(audio=audio_array, sampling_rate=sr, return_tensors="pt", language="en")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(outputs, skip_special_tokens=True, audio_chunk_index=audio_chunk_index)[0]
elapsed = time.time() - start

print(f"Transcribed {duration_s:.0f}s of audio in {elapsed:.1f}s (RTFx: {duration_s/elapsed:.0f}x)")
print(text)

Batched inference:

from transformers.audio_utils import load_audio

audio_short = load_audio("short.mp3", sampling_rate=16000)
audio_long = load_audio("long.mp3", sampling_rate=16000)

inputs = processor(
    [audio_short, audio_long],
    sampling_rate=16000,
    return_tensors="pt",
    language="en"
)
audio_chunk_index = inputs.get("audio_chunk_index")
inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, max_new_tokens=256)
texts = processor.decode(outputs, skip_special_tokens=True,
                         audio_chunk_index=audio_chunk_index, language="en")
print(texts)

Punctuation control:

# With punctuation (default)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt",
                   language="en", punctuation=True)

# Without punctuation (lowercase, no punctuation marks β€” useful for downstream NLP)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt",
                   language="en", punctuation=False)

Optimized throughput with compilation:

import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

model_id = "CohereLabs/cohere-transcribe-03-2026"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, trust_remote_code=True).cuda().eval()

texts = model.transcribe(
    processor=processor,
    audio_arrays=[audio_array],
    sample_rates=[sr],
    language="en",
    compile=True,              # torch.compile for higher throughput
    pipeline_detokenization=True,
    batch_size=16
)
print(texts[0])

Production deployment with vLLM

# Install
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
uv pip install vllm[audio] librosa

# Start server
vllm serve CohereLabs/cohere-transcribe-03-2026 --trust-remote-code

# Send a request
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -F "file=@audio.wav" \
  -F "model=CohereLabs/cohere-transcribe-03-2026"

Ecosystem

Platform Status
Hugging Face Transformers Native support
vLLM Production serving
mlx-audio Apple Silicon
Rust cohere_transcribe_rs
Browser transformers.js + WebGPU
Chrome extension cohere_transcribe_extension
iOS Whisper Memos

18 quantized variants are also available on the Hub.


Limitations

  • No automatic language detection β€” you must specify the language code upfront; the model will not switch languages mid-audio.
  • No timestamps or speaker diarization β€” if you need word-level timestamps or who-said-what, you'll need a separate pipeline.
  • Silence handling β€” the model may attempt to transcribe non-speech sounds; a VAD (voice activity detection) preprocessing step is recommended for noisy environments.
  • Code-switching β€” inconsistent on audio that switches between languages within the same utterance.

Conclusion

Cohere Transcribe 03-2026 makes a clear case on benchmarks: #1 WER on the English ASR leaderboard, 3x faster than comparable models, under Apache 2.0. For teams building transcription pipelines β€” meeting notes, call center analytics, subtitle generation β€” this is now the strongest open-weights option at any size.

The automatic chunking for long-form audio, punctuation control, and broad ecosystem support (vLLM, Apple Silicon, browser, mobile) make it practical across a wide range of deployment scenarios.

Model: CohereLabs/cohere-transcribe-03-2026