LFM2.5-VL-450M: Liquid AI's 450M vision model that runs in a browser
- Bastien
- 17 Apr, 2026
What is LFM2.5-VL-450M
Most vision-language models compete on scale — billions of parameters, hundreds of GPU-hours for inference. Liquid AI takes the opposite approach. LFM2.5-VL-450M is a 450M-parameter multimodal model that understands images and text across 9 languages, predicts bounding boxes, supports function calling, and runs real-time video captioning directly in a browser via WebGPU.
At 0.4B parameters, it is roughly 500x smaller than frontier models like MiniMax-M2.7 or GPT-5. Yet it outperforms SmolVLM2-500M on nearly every benchmark and introduces capabilities — visual grounding, tool use — that models 10x its size often lack.
Architecture
LFM2.5-VL-450M combines two components:
- LFM2.5-350M — a 350M dense Transformer serving as the language backbone (32K context window, 65K vocabulary)
- SigLIP2 NaFlex — an 86M shape-optimized vision encoder that processes images at native resolution up to 512×512 without upscaling or aspect ratio distortion
For larger images, the encoder uses an adaptive tiling strategy: non-overlapping 512×512 patches with a thumbnail encoding for global context. Users can tune the quality/speed tradeoff at inference time by adjusting max_image_tokens (32–256) and tile count — no retraining required.
The model ships in BF16 (safetensors), with GGUF, ONNX, and multiple MLX quantizations (4-bit through BF16) for Apple Silicon.
Benchmark results
Vision understanding
| Benchmark | LFM2.5-VL-450M | LFM2-VL-450M | SmolVLM2-500M |
|---|---|---|---|
| MMStar | 43.0 | 40.9 | 38.2 |
| RealWorldQA | 58.4 | 52.0 | 49.9 |
| MMBench (dev en) | 60.9 | 56.3 | 52.3 |
| POPE | 86.9 | 83.8 | 82.7 |
| MMVet | 41.1 | 33.9 | 29.9 |
| OCRBench | 684 | 657 | 609 |
| MM-IFEval | 45.0 | 33.1 | 11.3 |
| CountBench | 73.3 | 47.6 | 61.8 |
| RefCOCO-M | 81.3 | — | — |
LFM2.5-VL leads on every vision benchmark except MMMU (32.7 vs SmolVLM2’s 34.1 — a knowledge-intensive benchmark where larger models have a structural advantage). The MM-IFEval jump from 33.1 to 45.0 reflects significantly better instruction following on visual tasks.
RefCOCO-M at 81.3 is a new capability: bounding box prediction for visual grounding, not available in the previous LFM2 generation.
Multilingual vision (MMMB)
| LFM2.5-VL-450M | LFM2-VL-450M | SmolVLM2-500M |
|---|---|---|
| 68.1 | 54.3 | 46.8 |
MMMB averages vision understanding across 8 languages (Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Spanish). The +13.8 point improvement over the previous version is the largest single-benchmark gain.
Language and tool use
| Benchmark | LFM2.5-VL-450M | LFM2-VL-450M | SmolVLM2-500M |
|---|---|---|---|
| MMLU Pro | 19.3 | 17.2 | 13.6 |
| IFEval | 61.2 | 51.8 | 30.1 |
| Multi-IF | 34.6 | 26.2 | 6.8 |
| BFCLv4 | 21.1 | — | — |
IFEval at 61.2 — double SmolVLM2’s score — shows that instruction following scales with training quality, not just parameter count. BFCLv4 is a new function calling benchmark; LFM2-VL did not support tool use at all.
Key capabilities
Visual grounding — M2.5-VL can predict bounding boxes in normalized [0,1] coordinates, returned as JSON arrays. This enables object detection workflows without a separate detection model.
Function calling — text-only tool use in ChatML format with <|tool_call_start|> / <|tool_call_end|> tokens. The model can decide when to call functions and format arguments correctly.
Multilingual vision — 9-language support is not bolted on: the model was trained with multilingual vision understanding as a first-class objective, scoring 68.1 on MMMB (vs 46.8 for SmolVLM2).
Inference-time flexibility — min_image_tokens and max_image_tokens let users trade quality for speed without retraining. A mobile deployment can use 32 tokens per image; a desktop pipeline can use 256.
Deployment
LFM2.5-VL-450M is designed to run everywhere:
| Framework | Use case |
|---|---|
| Transformers | Simple inference, fine-tuning |
| vLLM | High-throughput GPU production |
| SGLang | High-throughput GPU production |
| llama.cpp | CPU inference, local deployment |
| ONNX Runtime | Cross-platform, hardware-accelerated |
| MLX | Apple Silicon (4-bit through BF16) |
| WebGPU | Browser-based, real-time video captioning |
The WebGPU demo runs real-time video stream captioning entirely in the browser — no server, no API, no GPU required.
Default generation parameters: temperature=0.1, min_p=0.15, repetition_penalty=1.05.
Fine-tuning is supported via LoRA with both Unsloth and TRL.
Limitations
At 450M parameters, LFM2.5-VL is not suited for knowledge-intensive tasks — MMMU at 32.7 and MMLU Pro at 19.3 confirm this. It is a perception and instruction-following model, not a reasoning one.
Fine-grained OCR is acknowledged as a limitation despite the OCRBench score of 684. Function calling is text-only — tool use does not support vision input. Image processing is capped at 512×512 per tile, with larger images split into patches.
Conclusion
LFM2.5-VL-450M proves that useful vision-language capabilities do not require billions of parameters. Visual grounding, function calling, 9-language support, and real-time browser inference — all in a model that fits in 900MB of VRAM — makes this the most deployment-flexible VLM available today.
For edge applications, mobile deployments, browser-based tools, or any scenario where a 70B model is impractical, LFM2.5-VL fills a gap that larger models cannot reach by design.
Model: LiquidAI/LFM2.5-VL-450M · Paper: arxiv.org/abs/2511.23404
Tags :
- AI
- Liquid AI
- VLM
- Vision
- Edge AI
- Open Source