prometheus-eval/prometheus-13b-v1.0

TEXT GENERATIONConcurrency Cost:1Model Size:13BQuant:FP8Ctx Length:4kPublished:Oct 12, 2023License:apache-2.0Architecture:Transformer0.1K Open Weights Cold

Prometheus-13b-v1.0 by KAIST AI is a 13 billion parameter language model based on Llama-2-Chat, fine-tuned on 100K human feedback examples. It specializes in fine-grained evaluation of long-form responses, outperforming GPT-3.5-Turbo and Llama-2-Chat 70B, and performing on par with GPT-4 for evaluation tasks. This model is designed for customized LLM evaluation using reference answers and score rubrics, and can also function as a reward model for Reinforcement Learning from Human Feedback (RLHF).

Loading preview...

Overview

Prometheus-13b-v1.0, developed by KAIST AI, is a 13 billion parameter language model built upon the Llama-2-Chat architecture. It has been extensively fine-tuned using 100,000 feedback examples from the Feedback Collection dataset. This specialized training enables Prometheus to excel in the fine-grained evaluation of long-form responses, a task where it demonstrates performance comparable to GPT-4 and superior to GPT-3.5-Turbo and Llama-2-Chat 70B.

Key Capabilities

  • Fine-grained LLM Evaluation: Prometheus is designed to evaluate other large language models based on customized criteria, utilizing a provided instruction, the response to evaluate, a reference answer (score 5), and a detailed score rubric.
  • Cost-Effective Alternative to GPT-4: It offers a powerful yet cheaper solution for LLM evaluation, allowing users to define specific criteria such as child readability, cultural sensitivity, or creativity.
  • Reward Model for RLHF: The model can be effectively used as a reward model in Reinforcement Learning from Human Feedback (RLHF frameworks).

Prompt Format

Prometheus requires a specific input format comprising an instruction, the response under evaluation, a reference answer, and a score rubric with detailed criteria descriptions for scores 1 through 5. This structured input ensures precise and context-aware evaluations. The model's output includes detailed feedback and an integer score between 1 and 5, separated by [RESULT].