jinaai/ReaderLM-v2
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Jan 13, 2025License:cc-by-nc-4.0Architecture:Transformer0.8K Open Weights Warm

ReaderLM-v2 by Jina AI is a 1.54 billion parameter autoregressive, decoder-only transformer model with a 512K token context window. It specializes in converting raw HTML into formatted Markdown or JSON with high accuracy, supporting 29 languages. The model excels at HTML parsing, transformation, and text extraction, particularly for generating complex elements and structured JSON output.

Loading preview...

ReaderLM-v2: HTML-to-Markdown/JSON Conversion

ReaderLM-v2, developed by Jina AI, is a 1.54 billion parameter language model designed for advanced HTML processing. It excels at transforming raw HTML into well-structured Markdown or JSON, offering superior accuracy and an extended context window of up to 512K tokens.

Key Capabilities

  • Enhanced Markdown Generation: Produces high-quality Markdown, including complex elements like code fences, nested lists, tables, and LaTeX equations, due to a new training paradigm and improved data.
  • Direct JSON Output: Generates JSON directly from HTML using predefined schemas, streamlining data extraction workflows.
  • Longer Context Handling: Effectively processes long-form content with a combined input and output length of up to 512K tokens.
  • Multilingual Support: Comprehensive support for 29 languages, broadening its applicability across diverse content.
  • Improved Stability: Mitigates degeneration issues during long sequence generation through contrastive loss training.

Performance Highlights

ReaderLM-v2 demonstrates strong performance, outperforming larger models in HTML-to-Markdown tasks with a ROUGE-L score of 0.84 and a Levenshtein Distance of 0.22. For HTML-to-JSON tasks, it achieves an F1 Score of 0.81 and a Pass-Rate of 0.98, indicating high accuracy and reliability in structured data extraction.

Training Details

Built on Qwen2.5-1.5B-Instruction, ReaderLM-v2's training involved a sophisticated pipeline including the creation of a 1 million document html-markdown-1m dataset, synthetic data generation using Qwen2.5-32B-Instruction for drafting, refinement, and critique, followed by long-context pretraining, supervised fine-tuning, direct preference optimization, and self-play reinforcement tuning.

Popular Sampler Settings

Top 3 parameter combinations used by Featherless users for this model. Click a tab to see each config.

temperature
top_p
top_k
frequency_penalty
presence_penalty
repetition_penalty
min_p