anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2

TEXT GENERATIONConcurrency Cost:1Model Size:2BQuant:BF16Ctx Length:32kPublished:Apr 25, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2 is a 2 billion parameter Qwen3-based language model, fine-tuned using TRL GRPO with a KL anchor (beta=0.2) against its frozen base. This model is specifically optimized for agentic, multi-turn interactions where it is designed to clarify ambiguous user requests by asking questions before attempting to act. It excels in scenarios requiring an "ask-first" policy, such as event planning or medical intake, rather than hallucinating missing information.

Loading preview...

What is anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2?

This model is a 2 billion parameter Qwen3-based language model, fine-tuned by anurag203 using Group Relative Policy Optimization (GRPO) with a specific KL anchor (β=0.2). The primary goal of this training was to steer the model towards an "ask-first" policy, enabling it to clarify underspecified user requests through questions before proposing a plan.

Key Capabilities and Differentiators

  • Clarification-Oriented Agent: Unlike general chat assistants, this model is explicitly trained to ask clarifying questions when faced with ambiguous requests, rather than making assumptions or hallucinating.
  • GRPO with KL Anchor: The use of a KL anchor at β=0.2 was critical in preventing capability collapse observed in earlier runs, leading to a measurable improvement on held-out evaluations while preserving breadth across task families.
  • Cost-Efficient Training: The model was trained in approximately 78 minutes on a single A100 GPU, costing around $1.80, demonstrating efficient RL fine-tuning.
  • Specific Task Families: Evaluated across five task families including event_planning, medical_intake, meeting_scheduling, support_triage, and coding, with notable improvements in event_planning over the base model.

Should I use this for my use case?

Good for:

  • Research and Hackathons: Reproducing the KL-anchor ablation study on a small reasoner.
  • Demo and Education: Illustrating how a 1.7B parameter model can be guided towards an "ask-first" policy with a small RL budget.
  • Agentic, Multi-turn, Tool-using Settings: Ideal as a drop-in replacement for Qwen/Qwen3-1.7B where an agent needs to clarify ambiguous requests instead of hallucinating.

Not suitable for:

  • General chat assistance or open-ended prompts, as its reward shaping is highly specific.
  • Production, safety-critical, medical, or legal applications due to lack of RLHF safety alignment.
  • Non-English tasks, as it is limited to English.