Optical Character Recognition (OCR) acts as a bridge between the physical and digital worlds, capturing the essence of printed text and giving it life in the digital realm. It’s the technology that lets your phone instantly translate foreign menu items, helps historians preserve ancient literature, and saves countless hours of manual data entry.
At its core, OCR converts images of text into machine-readable formats, facilitating use-cases like document conversion and understanding. Document conversion involves transforming scanned text into formats like Markdown, Word, or PDF, while document understanding goes beyond simple text recognition to extract meaningful insights and structure from the content.
Enter SmolDocling, a multimodal Image-Text-to-Text model designed for efficient document processing. At only 256 million parameters, it provides full-page conversion while preserving document layout, structure, and spatial positioning. Its small size offers a cost-effective option for OCR by requiring less computational power and memory, ideal for rapid prototyping and edge devices.
For optimal performance, SmolDocling requires NVIDIA GPUs with CUDA support. DigitalOcean GPU Droplets with H100 GPUs offer the computational power needed for efficient processing at production scale.
SmolDocling processes documents using DocTags, a markup format that captures context and layout information. It preserves formatting with bounding box detection and supports specialized recognition for code, formulas, charts, tables, and figures. The model maintains document structure by handling list grouping and caption linking.
The DocTags format identifies the type, location, and content of elements like text, tables, pictures, and code. DocTags uses nested tags to preserve relationships between elements - captions within pictures, items within lists, and specialized Optimized Table-Structure Language (OTSL) notation for table structures. This approach maintains both the visual organization and semantic structure of complex documents, making SmolDocling effective for end-to-end document conversion tasks.
Tag Type | Description |
---|---|
XML-like Syntax | Uses XML-style notation with opening/closing tags for text blocks and standalone tags for instructions (e.g., <text>hello world</text>, <page_break>) |
Document Structure | Complete DocTags snippets enclosed in <doctag>…</doctag> can represent single or multiple pages separated by <page_break> tags |
Block Type Tags | <text>, <caption>, <footnote>, <formula>, <title>, <page_footer>, <page_header>, <picture>, <section_header>, <document_index>, <code>, <otsl>, <list_item>, <ordered_list>, <unordered_list> |
Location Encoding | Elements can include nested location tags specifying bounding box coordinates: <loc_x1><loc_y1><loc_x2><loc_y2> (0-500 grid system) |
Table Structure | Incorporates OTSL vocabulary for tables with extensions: <fcel> (full cell), <ecel> (empty cell), <ched> (column headers), <rhed> (row headers), <srow> (table sections) |
List Handling | <list_item> elements within <ordered_list> or <unordered_list> define list type |
Captions | <picture> and <otsl> elements can encapsulate a <caption> tag for descriptive information |
Code Handling | <code> elements preserve formatting and include <_programming-language_> classification tag (57 languages supported) |
Image Classification | <picture> elements include <image_class> tags for 20+ image types including charts, diagrams, codes, etc. |
Uniform Representation | Cropped page elements maintain the same DocTags representation as full-page counterparts |
Additional features of SmolDocling are summarized below:
Feature | Description |
---|---|
OCR + Layout Preservation | Extracts text while maintaining spatial structure |
Specialized Recognition | Supports code blocks, formulas, tables, and charts |
Full-Page Conversion | Processes all elements simultaneously |
Fast Inference | 0.35 seconds per page on A100 GPUs |
DocTags Markup | Captures document content and layout in a structured format |
SmolDocling integrates with Docling for import and export flexibility. The roadmap includes one-shot multi-page inference, enhanced chart recognition, and chemical structure detection.
Let’s go over the process of how SmolDocling converts images of document pages to DocTags sequences. First, the input images are processed through a vision encoder and then reshaped using projection and pooling techniques. Next, these processed image embeddings are combined with the text embeddings derived from the user’s prompt, interleaving them. Finally, this combined sequence is fed into an LLM to autoregressively generate the DocTags sequence.
Datasets used to train to emphasize document understanding and image captioning include The Cauldron, Docmatix, and MathWriting.
SmolDocling competes with models up to 27 times larger while reducing computational requirements. It performs well on business documents, academic papers, technical reports, patents, and forms. Unlike many OCR models that focus on scientific papers, SmolDocling is designed for a wide range of document types.
The model is available now, and additional datasets for charts, tables, equations, and code recognition will be released soon.
To run SmolDocling, set up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDA H100 option.
Launch the model in the web console using vLLM:
This script processes a PDF file and outputs structured document data in JSON format.
An alternative is to use the Transformers library from HuggingFace.
In this tutorial, we’ve explored SmolDocling, a compact yet powerful vision language model designed specifically for document conversion tasks. By leveraging its unified DocTags output format, we’ve seen how SmolDocling efficiently handles various document types—from standard text to complex forms and even code listings—while requiring significantly fewer computational resources than larger alternatives. The model’s balanced approach to efficiency and accuracy makes it an excellent choice for developers and organizations seeking to implement document understanding capabilities.
https://ritvik19.medium.com/papers-explained-333-smoldocling-a788ac739b92
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!