dongguanting/Qwen2.5-3B-ARPO

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Jul 24, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

dongguanting/Qwen2.5-3B-ARPO is a 3.1 billion parameter Qwen2.5-based language model developed by Guanting Dong and others, fine-tuned using Agentic Reinforced Policy Optimization (ARPO). This model is specifically designed for training multi-turn LLM-based agents, excelling in complex reasoning tasks that involve external tool interactions. It features an entropy-based adaptive rollout mechanism to enhance exploration and efficiency in tool-use scenarios, demonstrating superior performance in computational, knowledge, and deep search domains.

Loading preview...

Overview of dongguanting/Qwen2.5-3B-ARPO

dongguanting/Qwen2.5-3B-ARPO is a 3.1 billion parameter model based on the Qwen2.5 architecture, fine-tuned with the novel Agentic Reinforced Policy Optimization (ARPO) algorithm. Developed by Guanting Dong et al., ARPO is specifically engineered for training multi-turn Large Language Model (LLM)-based agents, addressing the challenge of balancing intrinsic long-horizon reasoning with proficiency in multi-turn tool interactions.

Key Capabilities & Innovations

  • Agentic Reinforcement Learning: Implements a novel RL algorithm tailored for LLM agents in multi-turn scenarios.
  • Adaptive Rollout Mechanism: Incorporates an entropy-based adaptive rollout that dynamically balances global and step-level sampling, promoting exploration in high-uncertainty steps following tool usage.
  • Advantage Attribution Estimation: Enables LLMs to internalize advantage differences in stepwise tool-use interactions, improving decision-making.
  • Enhanced Tool-Use Efficiency: Achieves improved performance on challenging benchmarks while requiring significantly fewer tool calls compared to existing methods.
  • Robust Reasoning: Demonstrates superiority across 13 benchmarks in computational reasoning, knowledge reasoning, and deep search domains.

Ideal Use Cases

  • Developing LLM-based Agents: Particularly suited for creating agents that require complex, multi-turn interactions and external tool utilization.
  • Automated Reasoning Systems: Applications demanding advanced computational and knowledge reasoning capabilities.
  • Dynamic Environments: Aligning LLM agents with real-time dynamic environments where efficient tool interaction is crucial.
  • Research in Agentic AI: A valuable resource for researchers exploring advanced reinforcement learning techniques for LLMs.