Name: y-ohtani/GRPO-TCR-Qwen3-4B-step800 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: y-ohtani

Model Overview

This model, y-ohtani/GRPO-TCR-Qwen3-4B-step800, is a 4 billion parameter Qwen3-based language model developed by y-ohtani. It has undergone a two-stage fine-tuning process: initial Supervised Fine-Tuning (SFT) for multi-turn agentic cold-start, followed by Group Relative Policy Optimization with Tool Call Reward (GRPO-TCR) for reinforcement learning. The training leverages the Open-AgentRL framework and the DemyAgent methodology.

Key Capabilities & Differentiators

Deliberative Agentic Reasoning: Specifically trained to perform multi-turn agentic reasoning, focusing on selective code_interpreter tool calls for math and coding problems.
GRPO-TCR Enhancements: Incorporates 5 key enhancements over standard GRPO, including Multi-turn tool calling (up to 12 turns), Tool Call Reward (TCR) to prevent exploration collapse, asymmetric clipping for exploration, an overlong penalty for conciseness, and KL removal for free exploration.
Optimized for Tool Use: Reinforces correct final answers, rewards tool usage attempts, and penalizes verbose responses to encourage efficient problem-solving.
Intermediate Checkpoint: This is an early checkpoint at step 800 out of 5,880 total steps, indicating ongoing training and potential for further improvement.

Intended Use Cases

Agentic Reasoning Tasks: Ideal for applications requiring an agent to solve complex math and coding problems by interacting with a code_interpreter tool.
Multi-turn Problem Solving: Suited for scenarios where problems require iterative steps and tool interaction over multiple turns.

Limitations

As an intermediate checkpoint, its training is not yet fully converged.
Performance on non-math/non-coding tasks may be reduced compared to the base instruct model.
Requires a compatible runtime (e.g., SandboxFusion) for tool calling functionality.

Overview

Model Overview

Key Capabilities & Differentiators

Intended Use Cases

Limitations

Full Model Card (README)