SpiceRL/DRA-GRPO
SpiceRL/DRA-GRPO is a 1.5 billion parameter language model developed by SpiceRL, featuring a substantial 131072 token context length. This model is distinguished by its application of Diversity-Aware Reward Adjustment (DRA) within a GRPO framework, a novel approach for R1-Zero-like training of large language models. It is primarily designed for research and development in advanced reinforcement learning from human feedback (RLHF) techniques.
Loading preview...
DRA-GRPO Model Overview
SpiceRL/DRA-GRPO is a 1.5 billion parameter language model that implements a novel training methodology detailed in the paper "DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models." This model leverages a significant 131072 token context window, enabling it to process and generate extensive sequences of text.
Key Capabilities
- Diversity-Aware Reward Adjustment (DRA): Integrates a unique reward adjustment mechanism to enhance the diversity of generated responses during reinforcement learning from human feedback (RLHF).
- GRPO Framework: Utilizes the GRPO (Generalized Policy Optimization) framework, a robust policy optimization algorithm, for stable and effective model training.
- R1-Zero-Like Training: Employs a training paradigm inspired by R1-Zero, focusing on improving the alignment and performance of large language models through advanced RL techniques.
- Extended Context Length: Benefits from a 131072 token context, allowing for deep contextual understanding and generation over long inputs.
Good For
- RLHF Research: Ideal for researchers and developers exploring advanced techniques in reinforcement learning from human feedback.
- Experimental LLM Training: Suitable for experimenting with novel reward modeling and policy optimization strategies in language model development.
- Understanding Diversity in LLMs: Provides a platform to study the impact of diversity-aware training on model outputs and alignment.
For more in-depth technical details, refer to the original research paper and the full codebase.