MUA-RL-8B: Multi-Turn Agentic Tool Use Model
The zzwkk/MUA-RL-8B is an 8 billion parameter model specifically engineered for multi-turn user-interacting agent reinforcement learning (RL). Its core innovation lies in its ability to manage complex, multi-turn conversations while proficiently using external tools to achieve user goals. This model is distinguished by being the first framework to incorporate LLM-simulated users directly into its RL training loop, allowing it to autonomously learn efficient communication strategies and tool utilization.
Key Capabilities & Features
- Multi-Turn Context Management: Designed to maintain conversational context over extended interactions.
- Agentic Tool Use: Excels at integrating and utilizing various tools to solve practical problems.
- Autonomous Learning: Leverages LLM-simulated users (specifically GPT-4o-2024-11-20) within its RL process, using Group Relative Policy Optimization (GRPO), to continuously improve its interaction and tool-use capabilities.
- Competitive Performance: Despite its 8B parameter size, MUA-RL-8B shows competitive performance on multi-turn tool-using benchmarks (e.g., TAU2, BFCL-V3, ACEBench Agent) when compared to larger open-source models like DeepSeek-V3-0324 and Qwen3-32B in non-thinking settings.
- 32K Context Length: Supports a substantial context window for processing longer interactions.
Good For
- Developing sophisticated conversational agents that require memory and tool-use capabilities.
- Applications needing autonomous problem-solving in dynamic, interactive environments.
- Research into reinforcement learning for agentic systems and user simulation in training.
- Building agents that can handle complex, multi-step tasks requiring external information or actions.