Name: williyam/redrob-qwen-grpo API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: williyam

Model Overview

redrob-qwen-grpo is a 0.8 billion parameter model, fine-tuned from Qwen/Qwen3-0.6B by Williyam using the GRPO (Generalized Reinforcement Learning with Policy Optimization) algorithm. Its primary function is explainable candidate ranking, generating structured JSON outputs that include a hiring decision, a score, and detailed reasons. A key differentiator is its training against a rule-based reward model, completely bypassing the need for an LLM-as-a-judge, ensuring auditable and interpretable decisions.

Key Capabilities

Explainable Candidate Ranking: Produces a JSON output with decision ("shortlist"/"reject"), score (0-1), and reasons (short, grounded bullet points).
Rule-Based Reward Training: Fine-tuned using a reward model with six interpretable components (format_valid, decision_match, score_alignment, reason_quality, length_penalty, no_hallucination), ensuring outputs adhere to specific criteria.
Improved Performance: Achieves a mean rule-based reward of 0.713, a significant improvement over the baseline's 0.539, particularly in reason_quality and score_alignment.
Structured Output: Designed to consistently return a valid JSON object, making it suitable for integration into automated pipelines.

Good For

Educational and Research Purposes: Demonstrates GRPO's effectiveness with rule-based rewards for structured output generation in real-world tasks.
Drop-in Component: Ideal for developers needing an LLM-powered candidate ranker that provides auditable JSON responses for shortlisting pipelines.
Reference Implementation: The entire training loop, environment, and reward model are open-source, serving as a valuable resource for similar projects.

Overview

Model Overview

Key Capabilities

Good For

Full Model Card (README)