opendatalab/MinerU-HTML
TEXT GENERATIONConcurrency Cost:1Model Size:0.8BQuant:BF16Ctx Length:32kPublished:Nov 26, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

opendatalab/MinerU-HTML is an advanced HTML main content extraction tool developed by OpenDatalab. This model leverages Large Language Models (LLMs) for intelligent content identification and uses state machine-guided generation to produce structured JSON output. It provides a complete pipeline for extracting primary content from HTML pages, featuring a fallback mechanism and comprehensive evaluation capabilities. MinerU-HTML is optimized for accurate and structured main content extraction from web pages.

Loading preview...