opendatalab/MinerU-HTML

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Nov 26, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

opendatalab/MinerU-HTML is an advanced HTML main content extraction tool developed by OpenDatalab Shanghai AILab, leveraging Large Language Models (LLMs) and fine-tuned on Qwen3. It provides a complete pipeline for intelligently identifying and extracting primary content from HTML pages using LLM-based classification and state machine-guided generation. This model is optimized for structured JSON output and includes robust features like a fallback mechanism and a comprehensive evaluation framework, making it suitable for web content processing and data extraction tasks.

Loading preview...

Overview

opendatalab/MinerU-HTML, also known as Dripper, is an advanced tool designed for extracting main content from HTML pages. Developed by OpenDatalab Shanghai AILab and fine-tuned on Qwen3, it utilizes Large Language Models (LLMs) for intelligent content identification. The system integrates a state machine for guiding content generation, ensuring structured JSON output, and includes a fallback mechanism for robust operation.

Key Capabilities

  • LLM-Powered Extraction: Intelligently identifies and extracts primary content from HTML using state-of-the-art language models.
  • Structured Output: Employs state machine guidance with logits processing to generate structured JSON output.
  • Robustness: Features an automatic fallback mechanism to alternative extraction methods in case of errors.
  • Comprehensive Evaluation: Includes a built-in evaluation framework with ROUGE and item-level metrics for performance assessment.
  • Flexible Deployment: Offers a FastAPI-based REST API server for easy integration and supports Ray-based parallel processing for large-scale evaluation.
  • Comparison Support: Integrates multiple baseline extractors for comparative analysis.

Good for

  • Web Content Extraction: Ideal for developers needing to programmatically extract main articles, blog posts, or other primary content from diverse HTML structures.
  • Data Processing Pipelines: Suitable for integrating into data pipelines that require clean, main content from web pages for further analysis or storage.
  • Research and Evaluation: Useful for researchers and developers comparing the performance of different HTML content extraction methods, thanks to its comprehensive evaluation framework and support for various baseline extractors.
  • LLM-based Applications: Provides a specialized tool for applications that benefit from LLM-driven content understanding and extraction from web sources.