y-ohtani/GRPO-TCR-Qwen3-4B-step800
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 27, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The y-ohtani/GRPO-TCR-Qwen3-4B-step800 is a 4 billion parameter Qwen3-based model fine-tuned for deliberative agentic reasoning, specifically for math and coding problems. It utilizes Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) to selectively call a `code_interpreter` tool across multiple turns. This model is optimized for concise, accurate responses and efficient tool usage in complex problem-solving scenarios.

Loading preview...

Model Overview

This model, y-ohtani/GRPO-TCR-Qwen3-4B-step800, is a 4 billion parameter Qwen3-based language model developed by y-ohtani. It has undergone a two-stage fine-tuning process: initial Supervised Fine-Tuning (SFT) for multi-turn agentic cold-start, followed by Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) for reinforcement learning. The training leverages the Open-AgentRL framework and the DemyAgent methodology.

Key Capabilities & Differentiators

  • Deliberative Agentic Reasoning: Specifically trained to perform multi-turn agentic reasoning, focusing on selective code_interpreter tool calls for math and coding problems.
  • GRPO-TCR Enhancements: Incorporates 5 key enhancements over standard GRPO, including Multi-turn tool calling (up to 12 turns), Tool Call Reward (TCR) to prevent exploration collapse, asymmetric clipping for exploration, an overlong penalty for conciseness, and KL removal for free exploration.
  • Optimized for Tool Use: Reinforces correct final answers, rewards tool usage attempts, and penalizes verbose responses to encourage efficient problem-solving.
  • Intermediate Checkpoint: This is an early checkpoint at step 800 out of 5,880 total steps, indicating ongoing training and potential for further improvement.

Intended Use Cases

  • Agentic Reasoning Tasks: Ideal for applications requiring an agent to solve complex math and coding problems by interacting with a code_interpreter tool.
  • Multi-turn Problem Solving: Suited for scenarios where problems require iterative steps and tool interaction over multiple turns.

Limitations

  • As an intermediate checkpoint, its training is not yet fully converged.
  • Performance on non-math/non-coding tasks may be reduced compared to the base instruct model.
  • Requires a compatible runtime (e.g., SandboxFusion) for tool calling functionality.