Overview
satoyutaka/Qwen2.5-7B-AgentBench-V4-BF16 is an advanced agent model, built upon the Qwen2.5-7B-Instruct architecture, specifically engineered for the AgentBench-comp evaluation environment. This V4 variant prioritizes extreme accuracy and long-context understanding, aiming to resolve complex multi-step tasks without generation errors.
Key Enhancements from V3
- Extended Context Length: Increased from 2048 to 4096 tokens, enabling the model to handle longer ALFWorld trajectories and capture extensive trial-and-error processes.
- "Iron Guard" Protocol Dataset: Trained exclusively on 171 meticulously curated, high-quality trajectories to eliminate hallucinations and formatting errors, replacing the standard SFT dataset.
- Optimized Training Method: Utilizes an optimized SFT approach with
batch_size=1, grad_accumulation=4, and disabled validation to maximize learning efficiency on the curated data, achieving a loss of 0.192. - Targeted Logic: Heavily focuses on specific complex planning tasks, including SQL aggregation commands (SUM/COUNT) and intricate ALFWorld navigation patterns, with strategically added Japanese logic (JP Spice).
Ideal Use Cases
- AgentBench-comp Evaluation: Designed and optimized for performance within this specific competitive environment.
- Complex Multi-step Reasoning: Excels in tasks requiring sequential decision-making and adherence to strict instructions over long contexts.
- SQL Aggregation and Navigation: Particularly strong in scenarios involving database queries with aggregation and complex environmental exploration like ALFWorld.