MUA-RL-14B: Multi-Turn Agentic Tool Use Model
MUA-RL-14B is a 14 billion parameter model with a 32K context length, developed by zzwkk, designed for advanced agentic tool use in multi-turn conversational settings. It is notable for being the first framework to integrate LLM-simulated users directly into its reinforcement learning (RL) loop, specifically using GPT-4o-2024-11-20 as the user simulator. This unique training approach, utilizing Group Relative Policy Optimization (GRPO), allows the model to autonomously learn efficient communication with users and effective tool utilization for complex, dynamic interactions.
Key Capabilities
- Multi-Turn Conversation Management: Maintains context and facilitates sustained interaction across multiple turns.
- Agentic Tool Use: Seamlessly integrates and utilizes various tools to solve practical problems.
- Autonomous Learning: Learns to communicate and use tools efficiently through a novel RL framework with simulated users.
- Competitive Performance: Achieves strong results on multi-turn tool-using benchmarks such as TAU2 Retail, TAU2 Airline, TAU2 Telecom, BFCL-V3 Multi Turn, and ACEBench Agent. The 14B model demonstrates performance competitive with or superior to larger open-source models like DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
Good For
- Developing AI agents that require persistent context and tool interaction over extended conversations.
- Applications demanding robust performance in complex, multi-step problem-solving scenarios.
- Research into reinforcement learning for agentic systems and user-simulated training environments.
For more details, refer to the research paper: MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for Agentic Tool Use