AtlaAI/Selene-1-Llama-3.3-70B
AtlaAI/Selene-1-Llama-3.3-70B is a 70 billion parameter large language model developed by Atla. Post-trained from Llama-3.3-70B, it functions as a state-of-the-art LLM-as-a-judge, excelling in nuanced, complex real-world evaluations across 11 benchmarks. This model is optimized for general-purpose evaluation tasks, supporting absolute scoring, classification, and pairwise preference with a 128K context length.
Loading preview...
Model Overview
Atla Selene 1 is a 70 billion parameter large language model developed by Atla, post-trained from Llama-3.3-70B. It is designed as a state-of-the-art LLM-as-a-judge, demonstrating frontier-level performance across 11 evaluation benchmarks. Selene 1 notably outperforms models like OpenAI's o1, o3-mini, GPT-4o, Anthropic's Claude 3.5 Sonnet, Meta's Llama 3.3, and DeepSeek's R1 in evaluation tasks.
Key Capabilities & Performance
Selene 1 excels in capturing human judgments on complex evaluations, achieving state-of-the-art results on FLASK, MT-Bench, RewardBench, and Auto-J. Its training involved a combined SFT+DPO objective, similar to Selene Mini, across a wide range of evaluation tasks and scoring criteria. The model supports a 128K context length and is primarily English-centric but also supports German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
It handles three main types of evaluation tasks:
- Absolute scoring: e.g., evaluating harmlessness on a scale.
- Classification: e.g., determining if a response addresses a query (Yes/No).
- Pairwise preference: e.g., identifying the more logically consistent response.
Use Cases & Steerability
Selene 1 is a general-purpose evaluation model that provides structured evaluation outputs and qualitative critiques with reasoning. It is highly steerable, allowing for customizable evaluation criteria, and can assess responses with or without reference responses. Cookbooks are available for common use cases like absolute scoring and RAG hallucination detection. For optimal results, users should utilize the provided prompt templates and apply the Llama 3 conversation template.