MUA-RL-32B: Multi-Turn Agentic Tool Use Model
MUA-RL-32B is a 32 billion parameter model specifically engineered for agentic tool use within complex, multi-turn conversational scenarios. Developed by zzwkk, this model introduces a novel framework that integrates LLM-simulated users directly into its reinforcement learning (RL) loop, allowing it to autonomously learn how to communicate effectively with users and leverage various tools to solve problems.
Key Capabilities & Features
- Multi-Turn Interaction: Designed to maintain context and effectively utilize tools across extended conversations.
- Autonomous Learning: Employs Group Relative Policy Optimization (GRPO) with LLM-simulated users (e.g., GPT-4o) for self-improvement in tool-using tasks.
- Agentic Tool Use: Seamlessly handles tool calling and response processing to complete complex tasks.
- Competitive Performance: Achieves strong results on benchmarks like TAU2 Retail, TAU2 Airline, BFCL-V3 Multi Turn, and ACEBench Agent, often matching or exceeding the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
- 32K Context Length: Supports extensive conversational history and complex task instructions.
Ideal Use Cases
- Customer Service Agents: Automating complex, multi-step customer interactions requiring tool access.
- Technical Support Bots: Resolving issues by interacting with various systems and maintaining conversation flow.
- Interactive Problem Solving: Applications where an agent needs to dynamically use tools based on user input over time.