Name: lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: lihaoxin2020

Model Overview

This model, lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-step50, is a 4 billion parameter Qwen3-based refiner checkpoint. It has undergone Group Relative Policy Optimization (GRPO) training, a reinforcement learning (RL) method, for 50 steps. The model is built upon the lihaoxin2020/qwen3-4B-instruct-refiner-sft base and is specifically configured for an answer_only refiner mode.

Key Training Details

Base Model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
Training Method: GRPO with DeepSpeed Stage 3, utilizing an answer_only refiner mode.
Dataset: Trained and evaluated on the lihaoxin2020/refiner_rl dataset.
Context Length: Supports a maximum token length of 8192, with a max prompt length of 6144 and a response length of 1024.
Reward Configuration: Incorporates a verification reward of 10.0 and applies a paper citation reward with a weight of 0.5.

Intended Use Cases

Response Refinement: Ideal for applications requiring the improvement or balancing of generated answers from a base instruction-tuned model.
Reinforcement Learning Research: Can serve as a checkpoint for further experimentation with GRPO or other RL fine-tuning techniques on Qwen3-based architectures.
Answer-Only Generation: Suited for scenarios where the focus is solely on refining the answer portion of a model's output, rather than the entire conversational turn.

Overview

Model Overview

Key Training Details

Intended Use Cases

Full Model Card (README)