Name: omron-sinicx/DGPO-qwen2.5-0.5b API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: omron-sinicx

DGPO: Agentic RAG for Compact Models

omron-sinicx/DGPO-qwen2.5-0.5b is a 0.5 billion parameter language model that leverages Distillation-Guided Policy Optimization (DGPO) to imbue compact models with agentic search capabilities. Traditional reinforcement learning (RL) often fails with smaller models due to poor initial outputs, training collapse, and ineffective exploration. DGPO addresses these challenges by combining cold-start knowledge distillation with teacher-guided reinforcement learning.

Key Capabilities & Innovations

Stable RL for Compact Models: DGPO's core principle, "Reward if correct, mimic teacher if wrong," provides a stable learning signal, preventing training collapse and enabling efficient exploration even for weak initial models.
Two-Phase Framework:
- Cold-Start Initialization: Uses teacher-generated outputs (TGO) for initial student training, providing high-quality trajectories and preventing early collapse.
- Distillation-Guided RL: Employs PPO-based RL, rewarding correct answers and applying a selective KL penalty only when the model is incorrect, leading to error-focused learning.
Agentic RAG Behavior: Trains models to perform multi-step search reasoning, including explicit <think>, <search>, <information>, and <answer> steps.
Performance Gains: Achieves up to a ~55x improvement over the base 0.5B parameter model on various QA benchmarks (NQ, TriviaQA, HotpotQA, etc.). Notably, the DGPO-trained 0.5B student model can sometimes surpass the performance of its 3B parameter teacher model.

Good For

Implementing agentic Retrieval Augmented Generation (RAG) in resource-constrained environments.
Developing compact language models capable of multi-step search and reasoning.
Scenarios where stable reinforcement learning for smaller models is critical.
Applications requiring efficient, agent-like information retrieval and synthesis without relying on very large models.