The y-ohtani/GRPO-TCR-Qwen3-4B-step800 is a 4 billion parameter Qwen3-based model fine-tuned for deliberative agentic reasoning, specifically for math and coding problems. It utilizes Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) to selectively call a `code_interpreter` tool across multiple turns. This model is optimized for concise, accurate responses and efficient tool usage in complex problem-solving scenarios.
Loading preview...
Model Overview
This model, y-ohtani/GRPO-TCR-Qwen3-4B-step800, is a 4 billion parameter Qwen3-based language model developed by y-ohtani. It has undergone a two-stage fine-tuning process: initial Supervised Fine-Tuning (SFT) for multi-turn agentic cold-start, followed by Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) for reinforcement learning. The training leverages the Open-AgentRL framework and the DemyAgent methodology.
Key Capabilities & Differentiators
- Deliberative Agentic Reasoning: Specifically trained to perform multi-turn agentic reasoning, focusing on selective
code_interpretertool calls for math and coding problems. - GRPO-TCR Enhancements: Incorporates 5 key enhancements over standard GRPO, including Multi-turn tool calling (up to 12 turns), Tool Call Reward (TCR) to prevent exploration collapse, asymmetric clipping for exploration, an overlong penalty for conciseness, and KL removal for free exploration.
- Optimized for Tool Use: Reinforces correct final answers, rewards tool usage attempts, and penalizes verbose responses to encourage efficient problem-solving.
- Intermediate Checkpoint: This is an early checkpoint at step 800 out of 5,880 total steps, indicating ongoing training and potential for further improvement.
Intended Use Cases
- Agentic Reasoning Tasks: Ideal for applications requiring an agent to solve complex math and coding problems by interacting with a
code_interpretertool. - Multi-turn Problem Solving: Suited for scenarios where problems require iterative steps and tool interaction over multiple turns.
Limitations
- As an intermediate checkpoint, its training is not yet fully converged.
- Performance on non-math/non-coding tasks may be reduced compared to the base instruct model.
- Requires a compatible runtime (e.g., SandboxFusion) for tool calling functionality.