ByteDance/Dolphin-v2
ByteDance's Dolphin-v2 is a 3 billion parameter universal document parsing model built on the Qwen2.5-VL-3B backbone, designed to accurately extract information from both digital-born and photographed documents. It employs a two-stage architecture with scalable anchor prompting to classify document types and perform layout analysis, supporting 21 element categories including code blocks and formulas. This model excels at precise spatial localization using absolute pixel coordinates and achieves an 89.45 overall score on OmniDocBench v1.5, making it highly effective for diverse document understanding tasks.
Loading preview...
Dolphin-v2: Universal Document Parsing
Dolphin-v2, developed by ByteDance, is an advanced 3 billion parameter document parsing model built upon the Qwen2.5-VL-3B backbone. It significantly enhances document understanding capabilities by seamlessly processing both digital-born and photographed documents, even those with realistic distortions. The model utilizes a document-type-aware two-stage architecture, incorporating scalable anchor prompting for robust performance.
Key Capabilities
- Universal Document Support: Handles a wide array of document types, including scanned and distorted images.
- Expanded Element Coverage: Supports 21 distinct element categories, such as hierarchical headings, paragraphs, mathematical formulas (LaTeX), tables (HTML), and code blocks with indentation preservation.
- Enhanced Precision: Achieves accurate spatial localization through the use of absolute pixel coordinates.
- Hybrid Parsing Strategy: Employs element-wise parallel parsing for digital documents and holistic page-level parsing for photographed documents.
- Specialized Modules: Includes dedicated parsing for complex elements like formulas (
P_formula), code (P_code), and tables (P_table).
Performance Highlights
Dolphin-v2 demonstrates superior performance, achieving an 89.45 overall score on OmniDocBench v1.5, marking a 14.78 point improvement over its predecessor. Notable scores include 86.72 CDM for formula parsing and 87.02 TEDS for table structure.
Good For
- Automated data extraction from diverse document formats.
- Converting complex documents (e.g., research papers, technical manuals) into structured data.
- Applications requiring high-accuracy parsing of tables, formulas, and code from images or PDFs.