RL-CAS TRL Agent Overview

This model, developed by Praveen Krovvidi, is a fine-tuned Qwen2.5-3B-Instruct model with 3.1 billion parameters and a 32,768-token context length. It specializes in selecting enterprise tools to resolve customer service queries, operating within a predefined action space rather than generating free-form responses.

Key Capabilities & Training

The agent's training pipeline involves a multi-stage approach:

SAC-Discrete: An initial MLP policy trained for 3,000 episodes in a dense reward environment with 13 actions (7 API tools, 5 terminal actions, 1 meta action).
Supervised Fine-Tuning (SFT): QLoRA fine-tuning on expert trajectories from the SAC agent, specifically those with high rewards.
Generative Reinforcement Learning from Policy Optimization (GRPO): Further RL training using live environment rewards.

The model's output is a JSON object containing chain-of-thought reasoning and a specific action_id to invoke tools like invoke_order_service, invoke_refund_service, or escalate_to_human_agent.

Performance and Intended Use

In a benchmark of 100 customer service queries across 8 categories, this TRL model achieved a 97% resolution rate. While its average reward (15.58) was slightly lower than GPT-4o-mini (19.43) and the SAC MLP policy (19.36), it demonstrates robust performance in its specialized domain. It is specifically intended for use within the RL-CAS customer service environment and requires the RL-CAS mock backend for tool execution. The model was trained on synthetic customer service data.

Limitations

Struggles with multi-tool sequences involving auth and refund actions.
Requires the RL-CAS mock backend for tool execution.
Trained on synthetic data, not real customer interactions.

Overview

RL-CAS TRL Agent Overview

Key Capabilities & Training

Performance and Intended Use

Limitations

Full Model Card (README)