lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 13, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100 is a 4 billion parameter Qwen3-based refiner model, fine-tuned using Group Relative Policy Optimization (GRPO) with reinforcement learning. This model is specifically trained to refine responses in an 'answer_only' mode, building upon the lihaoxin2020/qwen3-4B-refiner-sft-step-3201 base. It is optimized for tasks requiring precise and balanced answer generation, leveraging a 32768 token context length.

Loading preview...

Model Overview

This model, lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100, is a 4 billion parameter Qwen3-based refiner checkpoint. It has been fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning method, specifically at step 100 of its training. The base model for this refinement process was lihaoxin2020/qwen3-4B-refiner-sft-step-3201.

Key Training Details

  • Training Method: GRPO, utilizing DeepSpeed Stage 3 for efficient training.
  • Refiner Mode: Configured for answer_only generation, indicating a focus on producing concise and direct responses.
  • Context Length: Supports a maximum token length of 8192, with a maximum prompt token length of 6144 and a response length of 1024.
  • Reward Configuration: Incorporates a verification reward of 10.0 and applies a paper citation reward with a weight of 0.5, suggesting an emphasis on factual accuracy and source attribution.
  • Dataset: Trained and evaluated on the lihaoxin2020/refiner_rl dataset.

Potential Use Cases

  • Response Refinement: Ideal for applications where an initial model's output needs to be improved or made more precise.
  • Fact-Checking Assistance: The reward configuration, including paper citation rewards, suggests suitability for tasks requiring verifiable information.
  • Balanced Answer Generation: The 'balanced' aspect of its training implies it aims to provide well-rounded and unbiased answers.