Qianfan-OCR: #1 on OmniDocBench, 192 languages, Layout-as-Thought

What is Qianfan-OCR?

Qianfan-OCR is a document understanding model released by Baidu. It converts images of documents — PDFs, scans, photos, screenshots — directly into structured Markdown, JSON, or HTML, handling everything from simple text to complex tables, mathematical formulas, and charts.

It ranks #1 on OmniDocBench v1.5, the main benchmark for end-to-end document parsing, with an overall score of 93.12 — ahead of DeepSeek-OCR-v2 (91.09) and Gemini 3 Pro (90.33).

The key architectural decision is doing this in a single model rather than chaining separate OCR, layout, and comprehension modules. That means no information is lost between pipeline stages — which matters most for complex documents like charts, exam papers, or multi-column layouts.

Architecture

The model has three components:

Qianfan-ViT — a vision encoder with AnyResolution design. It tiles the input image into 448×448 patches and processes up to 4,096 visual tokens, supporting images up to 4K resolution without losing detail on dense documents.
Qwen3-4B — the language backbone from Alibaba's Qwen3 series, with Grouped Query Attention (32 query heads / 8 KV heads) and a 32K context window extendable to 131K.
Cross-Modal Adapter — a 2-layer MLP that bridges the vision encoder's 1024-dim output to the language model's 2560-dim input.

Total: ~4B parameters (5B including embeddings).

Layout-as-Thought

The most novel feature of Qianfan-OCR is Layout-as-Thought — a thinking mode that recovers structural analysis within an end-to-end model.

When enabled (by appending <think> to the prompt), the model first generates a structured representation of the page — bounding boxes, element types across 25 categories, and reading order — before producing the final output. This is the kind of information that traditional multi-stage pipelines compute separately at each step, but here it happens inside a single forward pass.

When to use thinking mode:

Document type	Recommendation
Exam papers, technical reports, newspapers	Enable (`<think>`)
Academic papers with equations and tables	Enable (`<think>`)
Single-column text, simple forms	Disable — better results without
Receipts and invoices	Disable

Supported tasks

25 element types across 9 task categories:

Category	Details
Document parsing	Image → Markdown, multi-page, JSON/HTML output
Layout analysis	Bounding boxes, 25 element types, reading order
Table recognition	Merged cells, rotated tables, HTML output
Formula recognition	Inline and display math, LaTeX output
Chart understanding	QA, trend analysis, data extraction
Key information extraction	Invoices, receipts, ID cards, medical records
Handwriting	Chinese and English
Scene text	Street signs, labels, natural images
Multilingual OCR	192 languages (Latin, Cyrillic, Arabic, CJK, and more)

Benchmark results

OmniDocBench v1.5 — #1 overall

Model	Overall	Text edit dist.	Formula	Table TEDs	Table TEDss	Read order
Qianfan-OCR	93.12	0.041	92.43	91.02	93.85	0.049
DeepSeek-OCR-v2	91.09	0.048	90.31	87.75	92.06	0.057
Gemini 3 Pro	90.33	0.065	89.18	88.28	90.29	0.071

Key information extraction — #1 (mean over 5 benchmarks)

Model	KIE mean score
Qianfan-OCR	87.9
Gemini 3.1 Pro	lower
Qwen3-VL-235B-A22B	lower

Document and chart understanding

Benchmark	Score
DocVQA	92.8
CharXiv DQ	94.0
CharXiv RQ	85.2
ChartQA	88.1
ChartQAPro	42.9
ChartBench	85.9

General OCR

Benchmark	Score
OCRBench	880
OCRBench v2 (EN)	56.0
OCRBench v2 (ZH)	60.77
CCOCR multilingual	76.7

Inference performance

Measured on a single NVIDIA A100:

Precision	Throughput
W16A16 (full BF16)	0.503 pages/second
W8A8 (quantized)	1.024 pages/second

W8A8 quantization roughly doubles throughput — useful for production pipelines processing large document volumes.

Deploy with vLLM for high-throughput serving:

vllm serve baidu/Qianfan-OCR --trust-remote-code

Or use the Transformers API directly:

import torch
from transformers import AutoModel, AutoTokenizer
from PIL import Image

model = AutoModel.from_pretrained(
    "baidu/Qianfan-OCR",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained("baidu/Qianfan-OCR", trust_remote_code=True)

# Parse a document to Markdown
pixel_values = load_image("./document.png").to(torch.bfloat16).to(model.device)
response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="Parse this document to Markdown.",
                      generation_config={"max_new_tokens": 16384})

# With Layout-as-Thought for complex layouts
response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="Parse this document to Markdown.<think>",
                      generation_config={"max_new_tokens": 16384})

# Key information extraction as JSON
response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="Extract name, date, and total amount as JSON.",
                      generation_config={"max_new_tokens": 16384})

Model ecosystem

Limitations

Trust remote code required — trust_remote_code=True needed for both model and tokenizer; audit the code before deploying in sensitive environments
Chart parsing quality — ChartQAPro score (42.9) indicates complex chart understanding still has room to grow
No built-in PDF support — you need to convert PDF pages to images first
Context window — default 32K, extendable to 131K; very long documents need splitting
License — not specified in the model card; check the repository for terms before commercial use

Conclusion

Qianfan-OCR's strongest argument is its benchmark position: it beats both a frontier closed model (Gemini 3 Pro) and the best open alternative (DeepSeek-OCR-v2) on the main document parsing benchmark, with a single 4B model. The Layout-as-Thought mechanism is a pragmatic solution to the classic tension between end-to-end simplicity and structured layout analysis.

For developers building document pipelines — invoice processing, exam grading, report extraction, or RAG over scanned documents — this is the strongest open-weights option available at this size.

Model: baidu/Qianfan-OCR

Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing