yolay/SPEAR-SearchQA-Qwen2.5-7B

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Cold

yolay/SPEAR-SearchQA-Qwen2.5-7B is a 7.6 billion parameter agentic LLM developed by Yulei Qin and collaborators, fine-tuned using the SPEAR curriculum-based self-imitation learning framework. This model is specifically designed for long-horizon, sparse-reward tasks, balancing exploration and exploitation through auxiliary tool-use rewards and replayed successful trajectories. It excels in complex question answering benchmarks like NQ, TriviaQA, and HotpotQA, demonstrating improved performance in agentic reinforcement learning scenarios.

Loading preview...

Model Overview

yolay/SPEAR-SearchQA-Qwen2.5-7B is a 7.6 billion parameter agentic Large Language Model (LLM) developed by Yulei Qin and collaborators. It is built upon the Qwen2.5-7B-Instruct architecture and fine-tuned using the novel SPEAR (Self-imitation with Progressive Exploration for Agentic Reinforcement Learning) framework. SPEAR is a curriculum-based self-imitation learning (SIL) approach designed to train agentic LLMs on challenging long-horizon, sparse-reward tasks.

Key Capabilities & Training

  • Curriculum-based Self-Imitation Learning (SIL): SPEAR balances exploration and exploitation by initially using auxiliary tool-use rewards for broad skill exploration, then strengthening self-imitation to leverage successful replayed experiences.
  • Adaptive Training: The framework stabilizes training and improves efficiency by adaptively managing entropy and integrating both on-policy and off-policy data from a replay buffer.
  • Agentic Reinforcement Learning: Optimized for multi-turn tool interactions and episode-level reward computation, enabling effective exploration in sparsely rewarded environments.

Performance & Use Cases

This model demonstrates enhanced performance on various complex question answering (QA) benchmarks when integrated with the Dr.BoT method. For instance, with 550 training steps, SPEAR-SearchQA-Qwen2.5-7B achieves an average score of 45.4 across NQ, TriviaQA, PopQA, HotpotQA, 2Wiki, MuSiQue, and Bamboogle, outperforming baseline RL methods. It is particularly well-suited for applications requiring robust agentic behavior and effective problem-solving in environments with delayed or infrequent rewards.