Name: lihaoxin2020/qwen3-4B-refiner-rubric-rl-step50 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: lihaoxin2020

Overview

This model, qwen3-4B-refiner-rubric-rl-step50, is a 4 billion parameter Qwen3-based refiner checkpoint. It has been fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning method, specifically at step 50 of its training. A key differentiator is its use of per-instance rubric rewards during training, which helps in generating more precise and contextually appropriate responses.

Key Capabilities

Refined Response Generation: Optimized for refining outputs, particularly in an answer_only mode, building upon the lihaoxin2020/qwen3-4B-refiner-sft-step-3201 base model.
Reward-Driven Quality: Incorporates a sophisticated reward configuration including a 10.0 verification reward and a 0.2 weight for paper citation rewards, indicating an emphasis on factual accuracy and structured responses.
Advanced Training Methodology: Utilizes GRPO with DeepSpeed Stage 3 for efficient and effective reinforcement learning, employing a constant learning rate of 5e-6 and a KL penalty of 0.001.
High Context Handling: Supports a maximum token length of 8192 for responses and a maximum prompt token length of 6144, with an overall max token length of 32768.

Good for

Improving Answer Quality: Ideal for applications requiring highly refined and accurate text generation, especially where responses need to adhere to specific rubrics or verification criteria.
Research and Development: Useful for researchers exploring advanced reinforcement learning techniques like GRPO and per-instance rubric rewards in language models.
Fact-Checking and Citation-Aware Generation: Its reward structure suggests suitability for tasks where verifying information and citing sources are important, potentially reducing hallucinations.

Overview

Overview

Key Capabilities

Good for

Full Model Card (README)