Name: ellamind/propella-1-4b API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ellamind

propella-1-4b: Multilingual Document Annotation for LLM Data Curation

ellamind/propella-1-4b is a 4 billion parameter model within the propella-1 family, specifically engineered for efficient and accurate text document annotation. It's designed to propel data curation by categorizing text across 18 distinct properties, which are organized into six categories: Core Content, Classification, Quality & Value, Audience & Purpose, Safety & Compliance, and Geographic Relevance. These annotations are crucial for filtering, selecting, and curating large-scale LLM training datasets.

Key Capabilities

Comprehensive Annotation: Annotates documents across 18 properties, including content quality, educational value, reasoning indicators, and time-sensitivity.
Multilingual Support: Highly proficient in 57 languages, enabling broad application across diverse linguistic datasets.
High Throughput: Optimized for fast inference, with the 4B model achieving 27.0 documents/second on an H100 GPU in fp8 precision, making it suitable for large-scale operations.
Flexible Input/Output: Handles various text formats (web pages, PDFs, code, math) and outputs strictly conforming JSON objects with enumerated values.
Context Length: Supports a 64k context length, with a recommended truncation at 50k characters for optimal performance.

Good For

LLM Training Data Curation: Ideal for preprocessing and filtering vast amounts of text data to create high-quality datasets for LLM training.
Content Evaluation: Assessing various aspects of text documents, from integrity and quality to safety and relevance.
Multilingual Data Processing: Projects requiring consistent annotation across a wide array of languages.
High-Volume Annotation Tasks: Scenarios demanding fast and efficient annotation of numerous documents.