praveenkrovvidi/rl-cas-trl-agent

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 25, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The praveenkrovvidi/rl-cas-trl-agent is a 3.1 billion parameter Qwen2.5-3B-Instruct model, fine-tuned using Supervised Fine-Tuning (SFT) and Generative Reinforcement Learning from Policy Optimization (GRPO). It is specifically designed to select appropriate enterprise tools to resolve customer service queries within a fixed action space. This model outputs JSON with chain-of-thought reasoning and an action ID, making it suitable for automated customer service environments requiring tool invocation.

Loading preview...

RL-CAS TRL Agent Overview

This model, developed by Praveen Krovvidi, is a fine-tuned Qwen2.5-3B-Instruct model with 3.1 billion parameters and a 32,768-token context length. It specializes in selecting enterprise tools to resolve customer service queries, operating within a predefined action space rather than generating free-form responses.

Key Capabilities & Training

The agent's training pipeline involves a multi-stage approach:

  • SAC-Discrete: An initial MLP policy trained for 3,000 episodes in a dense reward environment with 13 actions (7 API tools, 5 terminal actions, 1 meta action).
  • Supervised Fine-Tuning (SFT): QLoRA fine-tuning on expert trajectories from the SAC agent, specifically those with high rewards.
  • Generative Reinforcement Learning from Policy Optimization (GRPO): Further RL training using live environment rewards.

The model's output is a JSON object containing chain-of-thought reasoning and a specific action_id to invoke tools like invoke_order_service, invoke_refund_service, or escalate_to_human_agent.

Performance and Intended Use

In a benchmark of 100 customer service queries across 8 categories, this TRL model achieved a 97% resolution rate. While its average reward (15.58) was slightly lower than GPT-4o-mini (19.43) and the SAC MLP policy (19.36), it demonstrates robust performance in its specialized domain. It is specifically intended for use within the RL-CAS customer service environment and requires the RL-CAS mock backend for tool execution. The model was trained on synthetic customer service data.

Limitations

  • Struggles with multi-tool sequences involving auth and refund actions.
  • Requires the RL-CAS mock backend for tool execution.
  • Trained on synthetic data, not real customer interactions.