williyam/redrob-qwen-grpo
redrob-qwen-grpo is a 0.8 billion parameter Qwen3-0.6B model, fine-tuned by Williyam using GRPO for explainable candidate ranking. It excels at generating structured JSON outputs for hiring decisions, including a decision, score, and reasons, based on a rule-based reward model without an LLM-as-a-judge. This model is optimized for producing auditable and interpretable candidate assessments.
Loading preview...
Model Overview
redrob-qwen-grpo is a 0.8 billion parameter model, fine-tuned from Qwen/Qwen3-0.6B by Williyam using the GRPO (Generalized Reinforcement Learning with Policy Optimization) algorithm. Its primary function is explainable candidate ranking, generating structured JSON outputs that include a hiring decision, a score, and detailed reasons. A key differentiator is its training against a rule-based reward model, completely bypassing the need for an LLM-as-a-judge, ensuring auditable and interpretable decisions.
Key Capabilities
- Explainable Candidate Ranking: Produces a JSON output with
decision("shortlist"/"reject"),score(0-1), andreasons(short, grounded bullet points). - Rule-Based Reward Training: Fine-tuned using a reward model with six interpretable components (
format_valid,decision_match,score_alignment,reason_quality,length_penalty,no_hallucination), ensuring outputs adhere to specific criteria. - Improved Performance: Achieves a mean rule-based reward of 0.713, a significant improvement over the baseline's 0.539, particularly in
reason_qualityandscore_alignment. - Structured Output: Designed to consistently return a valid JSON object, making it suitable for integration into automated pipelines.
Good For
- Educational and Research Purposes: Demonstrates GRPO's effectiveness with rule-based rewards for structured output generation in real-world tasks.
- Drop-in Component: Ideal for developers needing an LLM-powered candidate ranker that provides auditable JSON responses for shortlisting pipelines.
- Reference Implementation: The entire training loop, environment, and reward model are open-source, serving as a valuable resource for similar projects.