dongguanting/Qwen3-14B-ARPO-DeepSearch

TEXT GENERATIONConcurrency Cost:1Model Size:14BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jul 24, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The dongguanting/Qwen3-14B-ARPO-DeepSearch model is a 14 billion parameter Qwen3-based large language model developed by Guanting Dong and collaborators. It is fine-tuned using Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm designed for multi-turn LLM-based agents. This model excels in computational reasoning, knowledge reasoning, and deep search domains by efficiently balancing intrinsic reasoning with multi-turn tool interactions, notably achieving improved performance with half the tool-use budget of existing methods. Its primary strength lies in enhancing LLM performance in complex, multi-step reasoning tasks requiring external tool use.

Loading preview...

Overview

The dongguanting/Qwen3-14B-ARPO-DeepSearch model is a 14 billion parameter Qwen3-based large language model, fine-tuned using Agentic Reinforced Policy Optimization (ARPO). Developed by Guanting Dong and collaborators, ARPO is a novel agentic reinforcement learning algorithm specifically designed for training multi-turn LLM-based agents. It addresses the challenge of balancing an LLM's intrinsic long-horizon reasoning capabilities with its proficiency in multi-turn tool interactions.

Key Capabilities & Innovations

  • Entropy-based Adaptive Rollout: ARPO incorporates an adaptive rollout mechanism that dynamically balances global trajectory sampling and step-level sampling. This promotes exploration at steps with high uncertainty, particularly after tool usage, by adapting to the increased entropy distribution of generated tokens observed after external tool interactions.
  • Advantage Attribution Estimation: The model integrates an advantage attribution estimation, allowing LLMs to internalize advantage differences in stepwise tool-use interactions, thereby improving decision-making in complex sequences.
  • Efficient Tool Usage: A significant highlight is ARPO's ability to achieve superior performance across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains, while using only half the tool-use budget required by existing methods.

Use Cases

This model is particularly well-suited for applications requiring:

  • Multi-turn agentic reasoning: Where LLMs need to interact with external tools over multiple steps to solve complex problems.
  • Computational and knowledge reasoning: Excelling in tasks that demand logical deduction and access to external knowledge.
  • Deep search applications: Where efficient and effective use of search tools is critical for task completion.