dongguanting/Qwen2.5-7B-ARPO
dongguanting/Qwen2.5-7B-ARPO is a 7.6 billion parameter language model developed by dongguanting, based on the Qwen2.5 architecture. It implements Agentic Reinforced Policy Optimization (ARPO), a novel RL algorithm designed for training multi-turn LLM-based agents. This model excels at balancing long-horizon reasoning with multi-turn tool interactions, particularly in computational reasoning, knowledge reasoning, and deep search domains, demonstrating improved performance with reduced tool-use budgets.
Loading preview...
What is dongguanting/Qwen2.5-7B-ARPO?
This model is a 7.6 billion parameter language model from dongguanting, built upon the Qwen2.5 architecture. It incorporates Agentic Reinforced Policy Optimization (ARPO), a new Reinforcement Learning (RL) algorithm specifically designed to enhance multi-turn interactions for LLM-based agents. ARPO addresses the challenge of LLMs exhibiting high uncertainty after external tool interactions by using an entropy-based adaptive rollout mechanism, balancing global trajectory sampling with step-level sampling to promote exploration in uncertain steps.
Key Capabilities & Differentiators
- Agentic RL Algorithm: Implements ARPO to improve multi-turn agentic behavior, particularly after tool usage.
- Uncertainty Handling: Dynamically balances exploration by adapting to high entropy (uncertainty) in token generation following tool interactions.
- Efficient Tool Use: Achieves superior performance across 13 challenging benchmarks (computational reasoning, knowledge reasoning, deep search) while requiring approximately half the tool-use budget compared to existing trajectory-level RL algorithms.
- Internalized Advantage: Enables LLMs to internalize stepwise advantage differences in tool-use interactions through advantage attribution estimation.
Should I use this for my use case?
This model is particularly well-suited for applications requiring robust, efficient, and intelligent multi-turn agentic behavior, especially when external tool interactions are frequent. If your use case involves complex reasoning tasks that benefit from tool use and you need an agent that can navigate uncertainty effectively and efficiently, dongguanting/Qwen2.5-7B-ARPO offers a strong solution. Its demonstrated efficiency in tool calls makes it a scalable option for dynamic environments.