Name: opendatalab/MinerU-HTML API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: opendatalab

Overview

opendatalab/MinerU-HTML, also known as Dripper, is an advanced tool designed for extracting main content from HTML pages. Developed by OpenDatalab Shanghai AILab and fine-tuned on Qwen3, it utilizes Large Language Models (LLMs) for intelligent content identification. The system integrates a state machine for guiding content generation, ensuring structured JSON output, and includes a fallback mechanism for robust operation.

Key Capabilities

LLM-Powered Extraction: Intelligently identifies and extracts primary content from HTML using state-of-the-art language models.
Structured Output: Employs state machine guidance with logits processing to generate structured JSON output.
Robustness: Features an automatic fallback mechanism to alternative extraction methods in case of errors.
Comprehensive Evaluation: Includes a built-in evaluation framework with ROUGE and item-level metrics for performance assessment.
Flexible Deployment: Offers a FastAPI-based REST API server for easy integration and supports Ray-based parallel processing for large-scale evaluation.
Comparison Support: Integrates multiple baseline extractors for comparative analysis.

Good for

Web Content Extraction: Ideal for developers needing to programmatically extract main articles, blog posts, or other primary content from diverse HTML structures.
Data Processing Pipelines: Suitable for integrating into data pipelines that require clean, main content from web pages for further analysis or storage.
Research and Evaluation: Useful for researchers and developers comparing the performance of different HTML content extraction methods, thanks to its comprehensive evaluation framework and support for various baseline extractors.
LLM-based Applications: Provides a specialized tool for applications that benefit from LLM-driven content understanding and extraction from web sources.

Overview

Overview

Key Capabilities

Good for

Full Model Card (README)