praveenkrovvidi/rl-cas-trl-agent
The praveenkrovvidi/rl-cas-trl-agent is a 3.1 billion parameter Qwen2.5-3B-Instruct model, fine-tuned using Supervised Fine-Tuning (SFT) and Generative Reinforcement Learning from Policy Optimization (GRPO). It is specifically designed to select appropriate enterprise tools to resolve customer service queries within a fixed action space. This model outputs JSON with chain-of-thought reasoning and an action ID, making it suitable for automated customer service environments requiring tool invocation.
Loading preview...
RL-CAS TRL Agent Overview
This model, developed by Praveen Krovvidi, is a fine-tuned Qwen2.5-3B-Instruct model with 3.1 billion parameters and a 32,768-token context length. It specializes in selecting enterprise tools to resolve customer service queries, operating within a predefined action space rather than generating free-form responses.
Key Capabilities & Training
The agent's training pipeline involves a multi-stage approach:
- SAC-Discrete: An initial MLP policy trained for 3,000 episodes in a dense reward environment with 13 actions (7 API tools, 5 terminal actions, 1 meta action).
- Supervised Fine-Tuning (SFT): QLoRA fine-tuning on expert trajectories from the SAC agent, specifically those with high rewards.
- Generative Reinforcement Learning from Policy Optimization (GRPO): Further RL training using live environment rewards.
The model's output is a JSON object containing chain-of-thought reasoning and a specific action_id to invoke tools like invoke_order_service, invoke_refund_service, or escalate_to_human_agent.
Performance and Intended Use
In a benchmark of 100 customer service queries across 8 categories, this TRL model achieved a 97% resolution rate. While its average reward (15.58) was slightly lower than GPT-4o-mini (19.43) and the SAC MLP policy (19.36), it demonstrates robust performance in its specialized domain. It is specifically intended for use within the RL-CAS customer service environment and requires the RL-CAS mock backend for tool execution. The model was trained on synthetic customer service data.
Limitations
- Struggles with multi-tool sequences involving
authandrefundactions. - Requires the RL-CAS mock backend for tool execution.
- Trained on synthetic data, not real customer interactions.