Name: SpiceRL/DRA-GRPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: SpiceRL

DRA-GRPO Model Overview

SpiceRL/DRA-GRPO is a 1.5 billion parameter language model that implements a novel training methodology detailed in the paper "DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models." This model leverages a significant 131072 token context window, enabling it to process and generate extensive sequences of text.

Key Capabilities

Diversity-Aware Reward Adjustment (DRA): Integrates a unique reward adjustment mechanism to enhance the diversity of generated responses during reinforcement learning from human feedback (RLHF).
GRPO Framework: Utilizes the GRPO (Generalized Policy Optimization) framework, a robust policy optimization algorithm, for stable and effective model training.
R1-Zero-Like Training: Employs a training paradigm inspired by R1-Zero, focusing on improving the alignment and performance of large language models through advanced RL techniques.
Extended Context Length: Benefits from a 131072 token context, allowing for deep contextual understanding and generation over long inputs.

Good For

RLHF Research: Ideal for researchers and developers exploring advanced techniques in reinforcement learning from human feedback.
Experimental LLM Training: Suitable for experimenting with novel reward modeling and policy optimization strategies in language model development.
Understanding Diversity in LLMs: Provides a platform to study the impact of diversity-aware training on model outputs and alignment.

For more in-depth technical details, refer to the original research paper and the full codebase.

Overview

DRA-GRPO Model Overview

Key Capabilities

Good For

Full Model Card (README)