opendatalab/MinerU-HTML
opendatalab/MinerU-HTML is an advanced HTML main content extraction tool developed by OpenDatalab Shanghai AILab, leveraging Large Language Models (LLMs) and fine-tuned on Qwen3. It provides a complete pipeline for intelligently identifying and extracting primary content from HTML pages using LLM-based classification and state machine-guided generation. This model is optimized for structured JSON output and includes robust features like a fallback mechanism and a comprehensive evaluation framework, making it suitable for web content processing and data extraction tasks.
Loading preview...
Overview
opendatalab/MinerU-HTML, also known as Dripper, is an advanced tool designed for extracting main content from HTML pages. Developed by OpenDatalab Shanghai AILab and fine-tuned on Qwen3, it utilizes Large Language Models (LLMs) for intelligent content identification. The system integrates a state machine for guiding content generation, ensuring structured JSON output, and includes a fallback mechanism for robust operation.
Key Capabilities
- LLM-Powered Extraction: Intelligently identifies and extracts primary content from HTML using state-of-the-art language models.
- Structured Output: Employs state machine guidance with logits processing to generate structured JSON output.
- Robustness: Features an automatic fallback mechanism to alternative extraction methods in case of errors.
- Comprehensive Evaluation: Includes a built-in evaluation framework with ROUGE and item-level metrics for performance assessment.
- Flexible Deployment: Offers a FastAPI-based REST API server for easy integration and supports Ray-based parallel processing for large-scale evaluation.
- Comparison Support: Integrates multiple baseline extractors for comparative analysis.
Good for
- Web Content Extraction: Ideal for developers needing to programmatically extract main articles, blog posts, or other primary content from diverse HTML structures.
- Data Processing Pipelines: Suitable for integrating into data pipelines that require clean, main content from web pages for further analysis or storage.
- Research and Evaluation: Useful for researchers and developers comparing the performance of different HTML content extraction methods, thanks to its comprehensive evaluation framework and support for various baseline extractors.
- LLM-based Applications: Provides a specialized tool for applications that benefit from LLM-driven content understanding and extraction from web sources.