Overview
y-ohtani/GRPO-TCR-Qwen3-4B-test is a 4 billion parameter model based on Qwen3-4B-Instruct-2507, developed by y-ohtani. It undergoes a two-stage fine-tuning process: Supervised Fine-Tuning (SFT) for multi-turn agentic cold-start, followed by Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) using reinforcement learning. This model is specifically trained for deliberative agentic reasoning, focusing on the selective use of a code_interpreter tool to solve math and coding problems efficiently across multiple turns.
Key Capabilities & Training Objectives
- Deliberative Agentic Reasoning: Trained to selectively call a
code_interpreter tool for problem-solving, avoiding verbose self-reasoning. - Reinforced Tool Usage: The GRPO-TCR stage rewards correct final answers and incentivizes tool usage attempts, even if initial answers are incorrect, to prevent exploration collapse.
- Concise Responses: An overlong penalty is applied during training to suppress verbose outputs and encourage efficient tool utilization.
- Multi-turn Tool Calling: Supports agentic reasoning across up to 16 turns.
- Enhanced GRPO: Incorporates 5 key enhancements over standard GRPO, including asymmetric clipping for exploration, KL removal for free exploration, and the Tool Call Reward (TCR).
Limitations
This model is explicitly a test run with minimal data (8 train / 4 test samples, 6 steps) to validate the training pipeline configuration. It is not intended for production or evaluation use and requires a compatible runtime for tool calling (e.g., SandboxFusion, code interpreter).