Overview
propella-1-4b: Multilingual Document Annotation for LLM Data Curation
ellamind/propella-1-4b is a 4 billion parameter model within the propella-1 family, specifically engineered for efficient and accurate text document annotation. It's designed to propel data curation by categorizing text across 18 distinct properties, which are organized into six categories: Core Content, Classification, Quality & Value, Audience & Purpose, Safety & Compliance, and Geographic Relevance. These annotations are crucial for filtering, selecting, and curating large-scale LLM training datasets.
Key Capabilities
- Comprehensive Annotation: Annotates documents across 18 properties, including content quality, educational value, reasoning indicators, and time-sensitivity.
- Multilingual Support: Highly proficient in 57 languages, enabling broad application across diverse linguistic datasets.
- High Throughput: Optimized for fast inference, with the 4B model achieving 27.0 documents/second on an H100 GPU in fp8 precision, making it suitable for large-scale operations.
- Flexible Input/Output: Handles various text formats (web pages, PDFs, code, math) and outputs strictly conforming JSON objects with enumerated values.
- Context Length: Supports a 64k context length, with a recommended truncation at 50k characters for optimal performance.
Good For
- LLM Training Data Curation: Ideal for preprocessing and filtering vast amounts of text data to create high-quality datasets for LLM training.
- Content Evaluation: Assessing various aspects of text documents, from integrity and quality to safety and relevance.
- Multilingual Data Processing: Projects requiring consistent annotation across a wide array of languages.
- High-Volume Annotation Tasks: Scenarios demanding fast and efficient annotation of numerous documents.