ellamind/propella-1-4b

Warm
Public
4B
BF16
32768
Jan 10, 2026
License: apache-2.0
Hugging Face
Overview

propella-1-4b: Multilingual Document Annotation for LLM Data Curation

ellamind/propella-1-4b is a 4 billion parameter model within the propella-1 family, specifically engineered for efficient and accurate text document annotation. It's designed to propel data curation by categorizing text across 18 distinct properties, which are organized into six categories: Core Content, Classification, Quality & Value, Audience & Purpose, Safety & Compliance, and Geographic Relevance. These annotations are crucial for filtering, selecting, and curating large-scale LLM training datasets.

Key Capabilities

  • Comprehensive Annotation: Annotates documents across 18 properties, including content quality, educational value, reasoning indicators, and time-sensitivity.
  • Multilingual Support: Highly proficient in 57 languages, enabling broad application across diverse linguistic datasets.
  • High Throughput: Optimized for fast inference, with the 4B model achieving 27.0 documents/second on an H100 GPU in fp8 precision, making it suitable for large-scale operations.
  • Flexible Input/Output: Handles various text formats (web pages, PDFs, code, math) and outputs strictly conforming JSON objects with enumerated values.
  • Context Length: Supports a 64k context length, with a recommended truncation at 50k characters for optimal performance.

Good For

  • LLM Training Data Curation: Ideal for preprocessing and filtering vast amounts of text data to create high-quality datasets for LLM training.
  • Content Evaluation: Assessing various aspects of text documents, from integrity and quality to safety and relevance.
  • Multilingual Data Processing: Projects requiring consistent annotation across a wide array of languages.
  • High-Volume Annotation Tasks: Scenarios demanding fast and efficient annotation of numerous documents.