EssentialAI/eai-distill-0.5b

Warm
Public
0.5B
BF16
131072
License: apache-2.0
Hugging Face
Overview

EAI-Distill-0.5b: Specialized Document Classification Model

EAI-Distill-0.5b is a compact yet powerful 0.5 billion parameter model developed by Essential AI, fine-tuned from Qwen2.5-0.5B-Instruct. Its primary function is the high-throughput classification of web documents into 12 distinct taxonomic categories, generating structured metadata for large-scale dataset curation.

Key Capabilities

  • Comprehensive Classification: Categorizes documents across dimensions such as Free Decimal Correspondence (FDC), Bloom's Taxonomy (cognitive process and knowledge domain), Document Type, Content Quality (extraction artifacts, missing content), and Educational Metadata (reasoning depth, technical correctness, educational level).
  • Teacher-Student Performance: Achieves an average Cohen's Îș agreement of 0.71-0.74 with golden annotators (GPT-4o and Claude 3.5 Sonnet), performing within 3% of its 64x larger teacher model, Qwen2.5-32b-Instruct.
  • Efficient Training: Trained on 82 billion synthetic tokens generated by Qwen2.5-32B-Instruct from 104 million Common Crawl documents, utilizing a sequence length of 16,384 tokens.

Good For

  • Large-scale Web Document Classification: Ideal for processing vast amounts of web data to generate metadata.
  • Dataset Curation: Facilitates taxonomic filtering and organization of datasets.
  • Content Quality Assessment: Useful for identifying extraction artifacts, missing content, and overall quality for training data preparation.
  • Educational Content Analysis: Aids in organizing and analyzing educational materials based on cognitive processes and knowledge domains.

Limitations

Optimized for English web documents, performance may vary on content significantly different from Common Crawl data or documents exceeding 30k characters due to automatic chunking.