Name: y-ohtani/GRPO-TCR-Qwen3-4B-test API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: y-ohtani

Overview

y-ohtani/GRPO-TCR-Qwen3-4B-test is a 4 billion parameter model based on Qwen3-4B-Instruct-2507, developed by y-ohtani. It undergoes a two-stage fine-tuning process: Supervised Fine-Tuning (SFT) for multi-turn agentic cold-start, followed by Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) using reinforcement learning. This model is specifically trained for deliberative agentic reasoning, focusing on the selective use of a code_interpreter tool to solve math and coding problems efficiently across multiple turns.

Key Capabilities & Training Objectives

Deliberative Agentic Reasoning: Trained to selectively call a code_interpreter tool for problem-solving, avoiding verbose self-reasoning.
Reinforced Tool Usage: The GRPO-TCR stage rewards correct final answers and incentivizes tool usage attempts, even if initial answers are incorrect, to prevent exploration collapse.
Concise Responses: An overlong penalty is applied during training to suppress verbose outputs and encourage efficient tool utilization.
Multi-turn Tool Calling: Supports agentic reasoning across up to 16 turns.
Enhanced GRPO: Incorporates 5 key enhancements over standard GRPO, including asymmetric clipping for exploration, KL removal for free exploration, and the Tool Call Reward (TCR).

Limitations

This model is explicitly a test run with minimal data (8 train / 4 test samples, 6 steps) to validate the training pipeline configuration. It is not intended for production or evaluation use and requires a compatible runtime for tool calling (e.g., SandboxFusion, code interpreter).

Overview

Overview

Key Capabilities & Training Objectives

Limitations

Full Model Card (README)