Overview
This model, y-ohtani/qwen3-4b-ra-sft-epoch3, is a 4 billion parameter Qwen3-based model that has undergone full fine-tuning (not LoRA) using the Open-AgentRL framework. It is the third epoch checkpoint from a total of 10 training epochs.
Key Capabilities
- Multi-turn Agentic Reasoning: Specifically trained to handle complex problems requiring multiple interaction turns.
- Tool Use: Proficient in utilizing a
code_interpreter tool to solve mathematical and coding challenges. - Agentic Loop Learning: Optimized to learn the full agentic process: Think → Code → Execute → Observe → Answer, by applying loss to all assistant turns.
- Foundation for RL: Designed as a "cold-start" model for further reinforcement learning stages, such as GRPO.
Training Details
The model was fine-tuned from Qwen/Qwen3-4B-Instruct-2507 with a maximum sequence length of 32,768 tokens. It was trained on 2,000 multi-turn conversations from the y-ohtani/open_agentrl_like_sft dataset, which is derived from swordfaith/ReTool-SFT-multi-turn and focuses on mathematical reasoning with a code interpreter. All training data is Apache-2.0 licensed.
Intended Use & Limitations
- Intended: Primarily for agentic reasoning tasks involving tool use, particularly in math and coding. It is an intermediate checkpoint for further RL training.
- Not Intended: For production deployment without additional evaluation or for tasks outside of its specialized domain, as performance on non-math/non-coding tasks may be degraded compared to the base instruct model.