Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents
Summary
Granite 4.0 3B Vision is a compact multimodal model designed for high-precision extraction of structured data from enterprise documents. Released as a LoRA adapter for the Granite 4.0 Micro language model, it specializes in parsing complex tables, interpreting charts, and extracting semantic key-value pairs.
Key Points
- Operates as a modular LoRA adapter on top of the Granite 4.0 Micro dense language model, allowing for seamless text-only fallbacks.
- Achieves an 86.4% score on the Chart2Summary benchmark and 62.1% on Chart2CSV.
- Demonstrates leading performance in table extraction using the TEDS metric, specifically scoring 92.1 on cropped tables and 79.3 on full-page documents via the PubTablesV2 benchmark.
- Delivers 85.5% Exact Match (EM) accuracy for semantic Key-Value Pair (KVP) extraction on the VAREX benchmark.
- Utilizes the ChartNet dataset, a 1.7 million-sample multimodal dataset featuring 24 chart types and 6 plotting libraries.
- Supports integration with Docling for automated document segmentation, cropping, and end-to-end processing of multi-page PDFs.
- Released under the Apache 2.0 license and available on HuggingFace.
Technical Details
The model employs a specialized architecture known as DeepStack Injection to handle the dual requirements of semantic understanding and spatial precision. Unlike standard vision-language models (VLMs) that inject visual information at a single point, DeepStack routes abstract visual features into earlier layers for semantic processing while feeding high-resolution spatial features into later layers. This architecture is specifically optimized for tasks where layout and fine-grained detail—such as cell boundaries in tables or precise values in line charts—are critical.
The model's capability in chart interpretation is driven by the ChartNet dataset, which uses a code-guided synthesis pipeline. Each of the 1.7 million samples consists of five aligned components: plotting code, the rendered image, a data table, a natural language summary, and QA pairs. This cross-modal alignment enables the model to reason across visual patterns, numerical data, and natural language. For deployment, the model can function as a standalone engine for targeted image extraction or as part of a larger pipeline where Docling handles OCR and layout parsing, redirecting cropped visual elements to the Granite Vision model for fine-grained extraction.
Impact / Why It Matters
Developers can deploy a highly accurate, resource-efficient extraction engine that integrates into existing text-only pipelines without requiring separate model architectures for multimodal workloads. The modular LoRA design and compatibility with Docling enable scalable, automated document processing with significantly reduced computational overhead for enterprise-scale PDF analysis.