lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50
The lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50 is a 4 billion parameter Qwen3-based refiner model, specifically fine-tuned using Group Relative Policy Optimization (GRPO) with reinforcement learning. This model is designed to refine responses in an 'answer_only' mode, building upon the lihaoxin2020/qwen3-4B-instruct-refiner-sft base. Its primary application is enhancing the quality and relevance of generated answers through advanced RL training.
Loading preview...
Model Overview
This model, lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50, is a 4 billion parameter Qwen3-based refiner checkpoint. It has undergone Group Relative Policy Optimization (GRPO) training, a reinforcement learning (RL) method, for 50 steps. The model is built upon the lihaoxin2020/qwen3-4B-instruct-refiner-sft base and is specifically configured for an answer_only refiner mode.
Key Training Details
- Base Model:
lihaoxin2020/qwen3-4B-instruct-refiner-sft - Training Method: GRPO with DeepSpeed Stage 3, utilizing an
answer_onlyrefiner mode. - Dataset: Trained and evaluated on the
lihaoxin2020/refiner_rldataset. - Context Length: Supports a maximum token length of 8192, with a max prompt length of 6144 and a response length of 1024.
- Reward Configuration: Incorporates a verification reward of 10.0 and applies a paper citation reward with a weight of 0.5.
Intended Use Cases
- Response Refinement: Ideal for applications requiring the improvement or balancing of generated answers from a base instruction-tuned model.
- Reinforcement Learning Research: Can serve as a checkpoint for further experimentation with GRPO or other RL fine-tuning techniques on Qwen3-based architectures.
- Answer-Only Generation: Suited for scenarios where the focus is solely on refining the answer portion of a model's output, rather than the entire conversational turn.