lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100
The lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100 is a 4 billion parameter Qwen3-based refiner model, fine-tuned using Group Relative Policy Optimization (GRPO) with reinforcement learning. This model is specifically trained to refine responses in an "answer_only" mode, building upon the lihaoxin2020/qwen3-4B-instruct-refiner-sft base. It is optimized for improving the quality and relevance of generated answers, particularly in tasks involving verification and citation. Its 32768 token context length supports processing substantial input for refinement.
Loading preview...
Model Overview
This model, lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100, is a 4 billion parameter Qwen3-based refiner checkpoint. It has been fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning method, specifically for refining model outputs in an "answer_only" mode.
Key Training Details
- Base Model:
lihaoxin2020/qwen3-4B-instruct-refiner-sft - Training Method: GRPO with DeepSpeed Stage 3, focusing on refining answers.
- Dataset: Trained and evaluated on the
lihaoxin2020/refiner_rldataset. - Reward Configuration: Incorporates a significant verification reward (10.0) and applies a paper citation reward with a weight of 0.5, indicating an emphasis on factual accuracy and proper sourcing.
- Context Length: Supports a maximum token length of 8192, with a response length of 1024 tokens, making it suitable for refining detailed answers.
What Makes This Model Different?
This model is distinct due to its specialized GRPO-trained refiner architecture, designed to enhance the quality of generated answers through reinforcement learning. Its focus on "answer_only" refinement, coupled with specific reward mechanisms for verification and citation, suggests an optimization for tasks requiring high factual integrity and well-supported responses. The use of a powerful judge model (Qwen/Qwen3.5-35B-A3B) during training further underscores its goal of producing high-quality, refined outputs.