Overview
propella-1-4b: A Specialized Multilingual Annotation Model
ellamind's propella-1-4b is a 4 billion parameter model from the propella-1 family, engineered for efficient and accurate text document annotation. It's part of a series of small, fast, and highly multilingual LLMs (including 0.6B and 1.7B variants) optimized for data curation tasks.
Key Capabilities
- Comprehensive Annotation: Annotates documents across 18 distinct properties, categorized into Core Content, Classification, Quality & Value, Audience & Purpose, Safety & Compliance, and Geographic Relevance. This includes metrics like content integrity, information density, educational value, reasoning indicators, and PII presence.
- Multilingual Support: Capable of processing text in 57 languages, making it suitable for diverse global datasets.
- High Throughput: Designed for high-throughput inference, with the 4B model achieving 27.0 docs/s on an H100 GPU (fp8), significantly reducing the time needed to annotate large datasets.
- Structured Output: Generates annotations as strict JSON objects with enumerated values, ensuring consistency and ease of integration.
- Context Length: Supports a 64k context length, with a recommendation to truncate documents at 50k characters for optimal performance.
Good For
- LLM Training Data Curation: Ideal for filtering, selecting, and curating large-scale datasets for training other language models.
- Automated Content Evaluation: Automatically assessing various aspects of text documents, from quality and safety to audience relevance.
- Multilingual Data Processing: Handling and annotating text content across a wide array of languages efficiently.
- High-Volume Annotation Tasks: Leveraging its speed and optimized inference (including fp8 support) for processing millions of documents.