dongguanting/Qwen3-8B-AEPO-DeepSearch

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kLicense:mitArchitecture:Transformer0.0K Open Weights Cold

The dongguanting/Qwen3-8B-AEPO-DeepSearch model is an 8 billion parameter Qwen3-based language model developed by Guanting Dong and collaborators, implementing the Agentic Entropy-Balanced Policy Optimization (AEPO) algorithm. Optimized for multi-turn, long-horizon tool-use capabilities in web agents, it features a 32768 token context length. This model excels at balancing entropy during agentic reinforcement learning to prevent training collapse and improve rollout sampling diversity, making it suitable for complex agentic tasks.

Loading preview...

Overview

This model, Qwen3-8B-AEPO-DeepSearch, is an 8 billion parameter implementation of the Agentic Entropy-Balanced Policy Optimization (AEPO) algorithm. Developed by Guanting Dong and his team, AEPO addresses challenges in Agentic Reinforcement Learning (RL) by balancing entropy during both the rollout and policy update phases. It is specifically designed to enhance the multi-turn, long-horizon tool-use capabilities of web agents, offering a significant improvement over mainstream agentic RL algorithms.

Key Capabilities

  • Entropy-Balanced Rollout: Features a dynamic mechanism that adaptively allocates global and branch sampling budgets through entropy pre-monitoring, preventing over-branching in high-uncertainty tool-call steps.
  • Entropy-Balanced Policy Optimization: Incorporates a stop-gradient operation for high-entropy clipping and uses entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens.
  • Improved Performance: Consistently outperforms 7 mainstream RL algorithms across 14 challenging datasets, achieving impressive results on benchmarks like GAIA, Humanity's Last Exam, and WebWalker with just 1K RL samples.
  • Stable Training: Facilitates scalable web agent training by improving rollout sampling diversity while maintaining stable policy entropy.

Good For

  • Developing advanced web agents requiring robust multi-turn and long-horizon tool-use.
  • Applications where stable and efficient agentic reinforcement learning is crucial.
  • Research and development in agentic AI, particularly for tasks involving complex decision-making and tool interaction.