omron-sinicx/DGPO-qwen2.5-0.5b
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Mar 18, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

DGPO-qwen2.5-0.5b is a 0.5 billion parameter language model developed by omron-sinicx, based on the Qwen2.5 architecture. It utilizes Distillation-Guided Policy Optimization (DGPO), a reinforcement learning framework with integrated knowledge distillation, to enable agentic search behaviors in compact language models. This model is specifically designed to perform multi-step search reasoning for agentic RAG (Retrieval Augmented Generation) tasks, achieving significant performance improvements over base models and sometimes surpassing larger teacher models.

Loading preview...

DGPO: Agentic RAG for Compact Models

omron-sinicx/DGPO-qwen2.5-0.5b is a 0.5 billion parameter language model that leverages Distillation-Guided Policy Optimization (DGPO) to imbue compact models with agentic search capabilities. Traditional reinforcement learning (RL) often fails with smaller models due to poor initial outputs, training collapse, and ineffective exploration. DGPO addresses these challenges by combining cold-start knowledge distillation with teacher-guided reinforcement learning.

Key Capabilities & Innovations

  • Stable RL for Compact Models: DGPO's core principle, "Reward if correct, mimic teacher if wrong," provides a stable learning signal, preventing training collapse and enabling efficient exploration even for weak initial models.
  • Two-Phase Framework:
    • Cold-Start Initialization: Uses teacher-generated outputs (TGO) for initial student training, providing high-quality trajectories and preventing early collapse.
    • Distillation-Guided RL: Employs PPO-based RL, rewarding correct answers and applying a selective KL penalty only when the model is incorrect, leading to error-focused learning.
  • Agentic RAG Behavior: Trains models to perform multi-step search reasoning, including explicit <think>, <search>, <information>, and <answer> steps.
  • Performance Gains: Achieves up to a ~55x improvement over the base 0.5B parameter model on various QA benchmarks (NQ, TriviaQA, HotpotQA, etc.). Notably, the DGPO-trained 0.5B student model can sometimes surpass the performance of its 3B parameter teacher model.

Good For

  • Implementing agentic Retrieval Augmented Generation (RAG) in resource-constrained environments.
  • Developing compact language models capable of multi-step search and reasoning.
  • Scenarios where stable reinforcement learning for smaller models is critical.
  • Applications requiring efficient, agent-like information retrieval and synthesis without relying on very large models.