open-thoughts/OpenThinkerAgent-8B-RL
OpenThinkerAgent-8B-RL by OpenThoughts is an 8 billion parameter Qwen3-based model, fine-tuned with an SFT then Reinforcement Learning (RL) recipe. It is specifically designed as an agentic coding model, excelling at tool-using tasks within a sandboxed environment to solve software engineering problems. This model features a 40,960-token context length and is optimized for agentic behavior through on-policy RL on a 5,000-task set.
Loading preview...
OpenThinkerAgent-8B-RL: An Agentic Coding Model
OpenThinkerAgent-8B-RL is an 8 billion parameter model developed by OpenThoughts, representing the final, RL-trained checkpoint in their SFT→RL recipe for agentic models. Built upon a Qwen3-8B architecture, this model was initially fine-tuned with supervised fine-tuning (SFT) using the OpenThoughts-Agent-SFT-ColdStartForRL-10K dataset, and subsequently enhanced through on-policy Reinforcement Learning (RL) on the OpenThoughts-Agent-RL-5K task set, reaching RL step 45.
Key Capabilities
- Agentic Coding: Designed to operate as a tool-using agent, capable of issuing shell commands and edits, and reasoning over terminal output to solve software engineering tasks.
- Qwen3 Architecture: Inherits general language capabilities from its Qwen3-8B base, featuring 36 layers, a hidden size of 4096, and a 40,960-token context length.
- RL-Optimized: Specifically optimized for agentic behavior through a rigorous RL training procedure, including RLOO-n advantage estimation and PPO clipping.
Good For
- Software Engineering Tasks: Ideal for applications requiring an AI agent to interact with development environments, execute code, and debug.
- Tool-Using Agents: Suitable for integration into systems where the model needs to leverage external tools and interpret their outputs.
It's important to note that while designed for agentic coding, outputs (including shell commands) may require review and should be executed in sandboxed environments. Evaluation results for this specific 8B RL checkpoint are currently pending.