Chandra: Advanced OCR for Structured Document Extraction

Chandra, developed by Datalab, is an 8 billion parameter OCR model specializing in converting images and PDFs into structured data formats like Markdown, HTML, and JSON. It focuses on high accuracy and preserving detailed layout information, making it suitable for complex document processing tasks.

Key Capabilities

Multi-format Output: Generates markdown, HTML, or JSON with comprehensive layout details.
Robust Document Understanding: Accurately reconstructs forms, including checkboxes, and handles complex layouts, tables, and mathematical expressions.
Handwriting Support: Features good performance in extracting text from handwritten content.
Image and Diagram Extraction: Identifies and extracts images and diagrams, complete with captions and structured metadata.
Multilingual Support: Supports over 40 languages for diverse document processing needs.

Performance Highlights

Chandra v0.1.0 demonstrates strong performance on the olmocr benchmark, achieving an 83.1 ± 0.9 Overall score. It particularly excels in categories such as "Old Scans Math" (80.3), "Tables" (88.0), "Old Scans" (50.4), and "Long tiny text" (92.3), outperforming several other OCR solutions including Datalab Marker, Mistral OCR API, Deepseek OCR, GPT-4o, and Gemini Flash 2 in overall accuracy.

Ideal Use Cases

Chandra is particularly well-suited for applications requiring precise extraction of structured data from a wide range of documents, including academic papers, legal documents, financial reports, and historical archives. Its ability to handle complex layouts and handwriting makes it valuable for digitizing and analyzing challenging content.

Overview

Chandra: Advanced OCR for Structured Document Extraction

Key Capabilities

Performance Highlights

Ideal Use Cases

Full Model Card (README)