ellamind/propella-1-4b
ellamind/propella-1-4b is a 4 billion parameter multilingual large language model from ellamind, designed for annotating text documents across 18 properties in six categories. This model excels at high-throughput data curation, providing fast and accurate annotations for filtering and selecting LLM training data at scale. It supports 57 languages and handles diverse text formats like web pages, PDFs, and code, making it ideal for robust data preprocessing pipelines.
Loading preview...
propella-1-4b: Multilingual Document Annotation for LLM Data Curation
ellamind/propella-1-4b is a 4 billion parameter model within the propella-1 family, specifically engineered for efficient and accurate text document annotation. It's designed to propel data curation by categorizing text across 18 distinct properties, which are organized into six categories: Core Content, Classification, Quality & Value, Audience & Purpose, Safety & Compliance, and Geographic Relevance. These annotations are crucial for filtering, selecting, and curating large-scale LLM training datasets.
Key Capabilities
- Comprehensive Annotation: Annotates documents across 18 properties, including content quality, educational value, reasoning indicators, and time-sensitivity.
- Multilingual Support: Highly proficient in 57 languages, enabling broad application across diverse linguistic datasets.
- High Throughput: Optimized for fast inference, with the 4B model achieving 27.0 documents/second on an H100 GPU in fp8 precision, making it suitable for large-scale operations.
- Flexible Input/Output: Handles various text formats (web pages, PDFs, code, math) and outputs strictly conforming JSON objects with enumerated values.
- Context Length: Supports a 64k context length, with a recommended truncation at 50k characters for optimal performance.
Good For
- LLM Training Data Curation: Ideal for preprocessing and filtering vast amounts of text data to create high-quality datasets for LLM training.
- Content Evaluation: Assessing various aspects of text documents, from integrity and quality to safety and relevance.
- Multilingual Data Processing: Projects requiring consistent annotation across a wide array of languages.
- High-Volume Annotation Tasks: Scenarios demanding fast and efficient annotation of numerous documents.