lihaoxin2020/qwen3-4B-refiner-rubric-rl-step50
The lihaoxin2020/qwen3-4B-refiner-rubric-rl-step50 is a 4 billion parameter Qwen3-based refiner model, fine-tuned using Group Relative Policy Optimization (GRPO) with per-instance rubric rewards. It leverages a base model and is trained with a focus on generating refined, high-quality responses, indicated by its reward configuration including verification and paper citation rewards. This model is designed for refining outputs, particularly in an "answer_only" mode, and supports a maximum context length of 32768 tokens.
Loading preview...
Overview
This model, qwen3-4B-refiner-rubric-rl-step50, is a 4 billion parameter Qwen3-based refiner checkpoint. It has been fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning method, specifically at step 50 of its training. A key differentiator is its use of per-instance rubric rewards during training, which helps in generating more precise and contextually appropriate responses.
Key Capabilities
- Refined Response Generation: Optimized for refining outputs, particularly in an
answer_onlymode, building upon thelihaoxin2020/qwen3-4B-refiner-sft-step-3201base model. - Reward-Driven Quality: Incorporates a sophisticated reward configuration including a 10.0 verification reward and a 0.2 weight for paper citation rewards, indicating an emphasis on factual accuracy and structured responses.
- Advanced Training Methodology: Utilizes GRPO with DeepSpeed Stage 3 for efficient and effective reinforcement learning, employing a constant learning rate of 5e-6 and a KL penalty of 0.001.
- High Context Handling: Supports a maximum token length of 8192 for responses and a maximum prompt token length of 6144, with an overall max token length of 32768.
Good for
- Improving Answer Quality: Ideal for applications requiring highly refined and accurate text generation, especially where responses need to adhere to specific rubrics or verification criteria.
- Research and Development: Useful for researchers exploring advanced reinforcement learning techniques like GRPO and per-instance rubric rewards in language models.
- Fact-Checking and Citation-Aware Generation: Its reward structure suggests suitability for tasks where verifying information and citing sources are important, potentially reducing hallucinations.