Overview
This model, Qwen3-8B-AEPO-DeepSearch, is an 8 billion parameter implementation of the Agentic Entropy-Balanced Policy Optimization (AEPO) algorithm. Developed by Guanting Dong and his team, AEPO addresses challenges in Agentic Reinforcement Learning (RL) by balancing entropy during both the rollout and policy update phases. It is specifically designed to enhance the multi-turn, long-horizon tool-use capabilities of web agents, offering a significant improvement over mainstream agentic RL algorithms.
Key Capabilities
- Entropy-Balanced Rollout: Features a dynamic mechanism that adaptively allocates global and branch sampling budgets through entropy pre-monitoring, preventing over-branching in high-uncertainty tool-call steps.
- Entropy-Balanced Policy Optimization: Incorporates a stop-gradient operation for high-entropy clipping and uses entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens.
- Improved Performance: Consistently outperforms 7 mainstream RL algorithms across 14 challenging datasets, achieving impressive results on benchmarks like GAIA, Humanity's Last Exam, and WebWalker with just 1K RL samples.
- Stable Training: Facilitates scalable web agent training by improving rollout sampling diversity while maintaining stable policy entropy.
Good For
- Developing advanced web agents requiring robust multi-turn and long-horizon tool-use.
- Applications where stable and efficient agentic reinforcement learning is crucial.
- Research and development in agentic AI, particularly for tasks involving complex decision-making and tool interaction.