datalab-to/chandra

VISIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Oct 21, 2025License:openrailArchitecture:Transformer0.5K Open Weights Cold

Chandra by Datalab is an 8 billion parameter OCR model designed for highly accurate text extraction from images and PDFs, preserving detailed layout information. It excels at converting documents into markdown, HTML, and JSON formats, supporting complex structures like tables, math, and forms, including handwriting. The model is optimized for comprehensive document understanding and supports over 40 languages.

Loading preview...

Chandra: Advanced OCR for Structured Document Extraction

Chandra, developed by Datalab, is an 8 billion parameter OCR model specializing in converting images and PDFs into structured data formats like Markdown, HTML, and JSON. It focuses on high accuracy and preserving detailed layout information, making it suitable for complex document processing tasks.

Key Capabilities

  • Multi-format Output: Generates markdown, HTML, or JSON with comprehensive layout details.
  • Robust Document Understanding: Accurately reconstructs forms, including checkboxes, and handles complex layouts, tables, and mathematical expressions.
  • Handwriting Support: Features good performance in extracting text from handwritten content.
  • Image and Diagram Extraction: Identifies and extracts images and diagrams, complete with captions and structured metadata.
  • Multilingual Support: Supports over 40 languages for diverse document processing needs.

Performance Highlights

Chandra v0.1.0 demonstrates strong performance on the olmocr benchmark, achieving an 83.1 ± 0.9 Overall score. It particularly excels in categories such as "Old Scans Math" (80.3), "Tables" (88.0), "Old Scans" (50.4), and "Long tiny text" (92.3), outperforming several other OCR solutions including Datalab Marker, Mistral OCR API, Deepseek OCR, GPT-4o, and Gemini Flash 2 in overall accuracy.

Ideal Use Cases

Chandra is particularly well-suited for applications requiring precise extraction of structured data from a wide range of documents, including academic papers, legal documents, financial reports, and historical archives. Its ability to handle complex layouts and handwriting makes it valuable for digitizing and analyzing challenging content.