Gen-Verse/Qwen3-4B-RA-SFT: Agentic Reasoning Model
Gen-Verse/Qwen3-4B-RA-SFT is a 4 billion parameter model, fine-tuned from Qwen3-4B-Instruct-2507, specifically designed for agentic reasoning tasks. Developed by Gen-Verse, this model incorporates insights from systematic investigations into agentic Reinforcement Learning (RL), focusing on data quality, training efficiency, and reasoning strategies.
Key Capabilities & Differentiators
- Optimized for Agentic Reasoning: Fine-tuned using a 3K agentic SFT dataset and further enhanced with a 30K agentic RL dataset, emphasizing real end-to-end trajectories and high data diversity.
- Performance: Demonstrates the ability for a 4B model to outperform 32B models on challenging reasoning benchmarks, attributed to effective RL recipes.
- Training Efficiency: Utilizes exploration-friendly techniques like reward clipping and entropy maintenance to boost training efficiency.
- Reasoning Strategy: Employs deliberative reasoning with selective tool calls, proving more effective than frequent invocation or verbose self-reasoning.
Ideal Use Cases
- Complex Problem Solving: Suitable for applications requiring sophisticated multi-step reasoning and decision-making.
- Agent-based Systems: Excellent for integrating into AI agents that need to plan, execute, and adapt based on environmental feedback.
- Research in Agentic AI: Provides a strong baseline and tool for further research into reinforcement learning for agentic capabilities.