y-ohtani/qwen3-4b-ra-sft-epoch3

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 19, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The y-ohtani/qwen3-4b-ra-sft-epoch3 is a 4 billion parameter Qwen3-based model, full fine-tuned by y-ohtani, specifically designed for multi-turn agentic reasoning with tool use. It excels at iteratively solving mathematical and coding problems by calling a code interpreter. This model is an intermediate checkpoint, optimized for agentic loops like Think-Code-Execute-Observe-Answer, and serves as a cold-start for subsequent reinforcement learning.

Loading preview...

Overview

This model, y-ohtani/qwen3-4b-ra-sft-epoch3, is a 4 billion parameter Qwen3-based model that has undergone full fine-tuning (not LoRA) using the Open-AgentRL framework. It is the third epoch checkpoint from a total of 10 training epochs.

Key Capabilities

  • Multi-turn Agentic Reasoning: Specifically trained to handle complex problems requiring multiple interaction turns.
  • Tool Use: Proficient in utilizing a code_interpreter tool to solve mathematical and coding challenges.
  • Agentic Loop Learning: Optimized to learn the full agentic process: Think → Code → Execute → Observe → Answer, by applying loss to all assistant turns.
  • Foundation for RL: Designed as a "cold-start" model for further reinforcement learning stages, such as GRPO.

Training Details

The model was fine-tuned from Qwen/Qwen3-4B-Instruct-2507 with a maximum sequence length of 32,768 tokens. It was trained on 2,000 multi-turn conversations from the y-ohtani/open_agentrl_like_sft dataset, which is derived from swordfaith/ReTool-SFT-multi-turn and focuses on mathematical reasoning with a code interpreter. All training data is Apache-2.0 licensed.

Intended Use & Limitations

  • Intended: Primarily for agentic reasoning tasks involving tool use, particularly in math and coding. It is an intermediate checkpoint for further RL training.
  • Not Intended: For production deployment without additional evaluation or for tasks outside of its specialized domain, as performance on non-math/non-coding tasks may be degraded compared to the base instruct model.