ReaderLM-v2: HTML-to-Markdown/JSON Conversion
ReaderLM-v2, developed by Jina AI, is a 1.54 billion parameter language model designed for advanced HTML processing. It excels at transforming raw HTML into well-structured Markdown or JSON, offering superior accuracy and an extended context window of up to 512K tokens.
Key Capabilities
- Enhanced Markdown Generation: Produces high-quality Markdown, including complex elements like code fences, nested lists, tables, and LaTeX equations, due to a new training paradigm and improved data.
- Direct JSON Output: Generates JSON directly from HTML using predefined schemas, streamlining data extraction workflows.
- Longer Context Handling: Effectively processes long-form content with a combined input and output length of up to 512K tokens.
- Multilingual Support: Comprehensive support for 29 languages, broadening its applicability across diverse content.
- Improved Stability: Mitigates degeneration issues during long sequence generation through contrastive loss training.
Performance Highlights
ReaderLM-v2 demonstrates strong performance, outperforming larger models in HTML-to-Markdown tasks with a ROUGE-L score of 0.84 and a Levenshtein Distance of 0.22. For HTML-to-JSON tasks, it achieves an F1 Score of 0.81 and a Pass-Rate of 0.98, indicating high accuracy and reliability in structured data extraction.
Training Details
Built on Qwen2.5-1.5B-Instruction, ReaderLM-v2's training involved a sophisticated pipeline including the creation of a 1 million document html-markdown-1m dataset, synthetic data generation using Qwen2.5-32B-Instruction for drafting, refinement, and critique, followed by long-context pretraining, supervised fine-tuning, direct preference optimization, and self-play reinforcement tuning.