>_Reeboot
Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing
AI

Qianfan-OCR: Baidu's 4B model that beats Gemini on document parsing

Baidu's Qianfan-OCR is a 4B end-to-end document understanding model that ranks #1 on OmniDocBench v1.5 — beating Gemini 3 Pro and DeepSeek-OCR-v2 on tables, formulas, layout, and key information extra

What is Qianfan-OCR?

Qianfan-OCR is a document understanding model released by Baidu. It converts images of documents — PDFs, scans, photos, screenshots — directly into structured Markdown, JSON, or HTML, handling everything from simple text to complex tables, mathematical formulas, and charts.

It ranks #1 on OmniDocBench v1.5, the main benchmark for end-to-end document parsing, with an overall score of 93.12 — ahead of DeepSeek-OCR-v2 (91.09) and Gemini 3 Pro (90.33).

The key architectural decision is doing this in a single model rather than chaining separate OCR, layout, and comprehension modules. That means no information is lost between pipeline stages — which matters most for complex documents like charts, exam papers, or multi-column layouts.


Architecture

The model has three components:

  • Qianfan-ViT — a vision encoder with AnyResolution design. It tiles the input image into 448×448 patches and processes up to 4,096 visual tokens, supporting images up to 4K resolution without losing detail on dense documents.
  • Qwen3-4B — the language backbone from Alibaba's Qwen3 series, with Grouped Query Attention (32 query heads / 8 KV heads) and a 32K context window extendable to 131K.
  • Cross-Modal Adapter — a 2-layer MLP that bridges the vision encoder's 1024-dim output to the language model's 2560-dim input.

Total: ~4B parameters (5B including embeddings).


Layout-as-Thought

The most novel feature of Qianfan-OCR is Layout-as-Thought — a thinking mode that recovers structural analysis within an end-to-end model.

When enabled (by appending <think> to the prompt), the model first generates a structured representation of the page — bounding boxes, element types across 25 categories, and reading order — before producing the final output. This is the kind of information that traditional multi-stage pipelines compute separately at each step, but here it happens inside a single forward pass.

When to use thinking mode:

Document type Recommendation
Exam papers, technical reports, newspapers Enable (<think>)
Academic papers with equations and tables Enable (<think>)
Single-column text, simple forms Disable — better results without
Receipts and invoices Disable

Supported tasks

25 element types across 9 task categories:

Category Details
Document parsing Image → Markdown, multi-page, JSON/HTML output
Layout analysis Bounding boxes, 25 element types, reading order
Table recognition Merged cells, rotated tables, HTML output
Formula recognition Inline and display math, LaTeX output
Chart understanding QA, trend analysis, data extraction
Key information extraction Invoices, receipts, ID cards, medical records
Handwriting Chinese and English
Scene text Street signs, labels, natural images
Multilingual OCR 192 languages (Latin, Cyrillic, Arabic, CJK, and more)

Benchmark results

OmniDocBench v1.5 — #1 overall

Model Overall Text edit dist. Formula Table TEDs Table TEDss Read order
Qianfan-OCR 93.12 0.041 92.43 91.02 93.85 0.049
DeepSeek-OCR-v2 91.09 0.048 90.31 87.75 92.06 0.057
Gemini 3 Pro 90.33 0.065 89.18 88.28 90.29 0.071

Key information extraction — #1 (mean over 5 benchmarks)

Model KIE mean score
Qianfan-OCR 87.9
Gemini 3.1 Pro lower
Qwen3-VL-235B-A22B lower

Document and chart understanding

Benchmark Score
DocVQA 92.8
CharXiv DQ 94.0
CharXiv RQ 85.2
ChartQA 88.1
ChartQAPro 42.9
ChartBench 85.9

General OCR

Benchmark Score
OCRBench 880
OCRBench v2 (EN) 56.0
OCRBench v2 (ZH) 60.77
CCOCR multilingual 76.7

Inference performance

Measured on a single NVIDIA A100:

Precision Throughput
W16A16 (full BF16) 0.503 pages/second
W8A8 (quantized) 1.024 pages/second

W8A8 quantization roughly doubles throughput — useful for production pipelines processing large document volumes.

Deploy with vLLM for high-throughput serving:

vllm serve baidu/Qianfan-OCR --trust-remote-code

Or use the Transformers API directly:

import torch
from transformers import AutoModel, AutoTokenizer
from PIL import Image

model = AutoModel.from_pretrained(
    "baidu/Qianfan-OCR",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained("baidu/Qianfan-OCR", trust_remote_code=True)

# Parse a document to Markdown
pixel_values = load_image("./document.png").to(torch.bfloat16).to(model.device)
response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="Parse this document to Markdown.",
                      generation_config={"max_new_tokens": 16384})

# With Layout-as-Thought for complex layouts
response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="Parse this document to Markdown.<think>",
                      generation_config={"max_new_tokens": 16384})

# Key information extraction as JSON
response = model.chat(tokenizer, pixel_values=pixel_values,
                      question="Extract name, date, and total amount as JSON.",
                      generation_config={"max_new_tokens": 16384})

Model ecosystem


Limitations

  • Trust remote code requiredtrust_remote_code=True needed for both model and tokenizer; audit the code before deploying in sensitive environments
  • Chart parsing quality — ChartQAPro score (42.9) indicates complex chart understanding still has room to grow
  • No built-in PDF support — you need to convert PDF pages to images first
  • Context window — default 32K, extendable to 131K; very long documents need splitting
  • License — not specified in the model card; check the repository for terms before commercial use

Conclusion

Qianfan-OCR's strongest argument is its benchmark position: it beats both a frontier closed model (Gemini 3 Pro) and the best open alternative (DeepSeek-OCR-v2) on the main document parsing benchmark, with a single 4B model. The Layout-as-Thought mechanism is a pragmatic solution to the classic tension between end-to-end simplicity and structured layout analysis.

For developers building document pipelines — invoice processing, exam grading, report extraction, or RAG over scanned documents — this is the strongest open-weights option available at this size.

Model: baidu/Qianfan-OCR