Name: lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: lihaoxin2020

Model Overview

This model, lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100, is a 4 billion parameter Qwen3-based refiner checkpoint. It has been fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning method, specifically for refining model outputs in an "answer_only" mode.

Key Training Details

Base Model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
Training Method: GRPO with DeepSpeed Stage 3, focusing on refining answers.
Dataset: Trained and evaluated on the lihaoxin2020/refiner_rl dataset.
Reward Configuration: Incorporates a significant verification reward (10.0) and applies a paper citation reward with a weight of 0.5, indicating an emphasis on factual accuracy and proper sourcing.
Context Length: Supports a maximum token length of 8192, with a response length of 1024 tokens, making it suitable for refining detailed answers.

What Makes This Model Different?

This model is distinct due to its specialized GRPO-trained refiner architecture, designed to enhance the quality of generated answers through reinforcement learning. Its focus on "answer_only" refinement, coupled with specific reward mechanisms for verification and citation, suggests an optimization for tasks requiring high factual integrity and well-supported responses. The use of a powerful judge model (Qwen/Qwen3.5-35B-A3B) during training further underscores its goal of producing high-quality, refined outputs.

Overview

Model Overview

Key Training Details

What Makes This Model Different?

Full Model Card (README)