ellamind/propella-1-1.7b
The ellamind/propella-1-1.7b is a 1.7 billion parameter multilingual LLM from the propella-1 family, developed by ellamind. It is specifically designed for annotating text documents across 18 properties in six categories, such as content quality, educational value, and safety. This model excels at high-throughput inference for data curation, supporting 57 languages and handling various text formats like web pages, PDFs, and code.
Loading preview...
Overview
ellamind/propella-1-1.7b is a 1.7 billion parameter model from the propella-1 family, developed by ellamind. It is a small, multilingual LLM engineered for efficient text document annotation, primarily to facilitate the filtering, selection, and curation of LLM training data at scale. The model is trained in fp8 precision, enabling fast and accurate inference.
Key Capabilities
- Comprehensive Annotation: Annotates documents across 18 distinct properties, categorized into Core Content, Classification, Quality & Value, Audience & Purpose, Safety & Compliance, and Geographic Relevance.
- Multilingual Support: Capable of processing text in 57 different languages.
- Versatile Input Handling: Supports various text formats including web pages, PDFs, code, and mathematical content.
- High Throughput: Optimized for high-throughput inference, achieving 39.1 documents/second on an H100 GPU in fp8 mode.
- Structured Output: Generates annotations as JSON objects conforming to a predefined schema with enumerated values.
Performance and Evaluation
The propella-1-1.7b model achieves an overall performance score of 0.737, evaluated against Gemini-3-Pro annotations as ground truth. It maintains strong performance in fp8 inference mode, with only a minor score difference compared to bf16. Evaluation metrics include QWK for ordinal properties, F1 for binary properties, and IoU for multi-select properties.
Good for
- Automated large-scale data curation and filtering for LLM training datasets.
- Rapid, structured annotation of diverse text documents.
- Applications requiring fast and accurate multilingual text analysis for content quality and relevance.