Model Overview

Atla's Selene-1-Mini-Llama-3.1-8B is an 8 billion parameter model, post-trained from Llama-3.1-8B, specifically designed as a small language model-as-a-judge (SLMJ). It demonstrates strong performance in evaluation tasks, achieving results comparable to models ten times its size.

Key Capabilities & Performance

Evaluation Expertise: Excels across 11 benchmarks covering absolute scoring (e.g., harmlessness 1-5), classification (e.g., addressing user query Yes/No), and pairwise preference tasks.
Benchmark Leader: Outperforms GPT-4o on RewardBench, EvalBiasBench, and AutoJ. It is also the #1 8B generative model on RewardBench.
Multilingual Support: Primarily English, but also supports German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Context Length: Features a substantial 128K context length.

Use Cases

This model is ideal for general-purpose evaluation, capable of handling diverse inputs and scoring scales. It generates structured evaluation outputs and provides qualitative critiques with reasoning. Specific applications include:

Absolute Scoring: Evaluating responses on a defined scale.
RAG Hallucination Detection: Identifying instances of hallucination in Retrieval-Augmented Generation (RAG) systems.

Users are encouraged to utilize the provided prompt templates for optimal results and to apply the Llama 3 conversation template.

Overview

Model Overview

Key Capabilities & Performance

Use Cases

Full Model Card (README)