swordli/Qwen2.5-3B-Base-SAPO

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Mar 6, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

swordli/Qwen2.5-3B-Base-SAPO is a 3.1 billion parameter model based on the Qwen2.5 architecture, developed by Jian Li et al. It implements SAPO, a policy optimization method designed to stabilize post-training for autonomous multi-turn search agents. This model is specifically optimized for improving search agent performance on complex, real-world question-answering tasks by enforcing token-level distributional constraints.

Loading preview...

Overview of swordli/Qwen2.5-3B-Base-SAPO

This model, developed by Jian Li et al., integrates SAPO (Search Agent with One Line of Code), a policy optimization method aimed at enhancing the stability and performance of autonomous multi-turn search agents. SAPO is designed to tackle complex, real-world question-answering scenarios by applying a conditional KL penalty.

Key Capabilities & Features

  • Policy Optimization: Utilizes a novel policy optimization method to stabilize post-training for search agents.
  • Simplified Implementation: Achieves its improvements with a "one line of code" approach, specifically a conditional KL penalty that enforces token-level distributional constraints on low-probability positive tokens.
  • Enhanced Performance: Demonstrates consistent performance gains across various search agents when evaluated on seven challenging QA benchmarks.

Ideal Use Cases

  • Autonomous Search Agents: Particularly well-suited for developers building or improving multi-turn search agents.
  • Complex QA Systems: Beneficial for applications requiring robust performance on intricate, real-world question-answering tasks.
  • Research in Agent Optimization: Provides a practical method for stabilizing agent training and improving outcomes with minimal code changes.