Name: lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: lihaoxin2020

Model Overview

This model, lihaoxin2020/qwen3-4B-refiner-3201-rl-balanced-step100, is a 4 billion parameter Qwen3-based refiner checkpoint. It has been fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning method, specifically at step 100 of its training. The base model for this refinement process was lihaoxin2020/qwen3-4B-refiner-sft-step-3201.

Key Training Details

Training Method: GRPO, utilizing DeepSpeed Stage 3 for efficient training.
Refiner Mode: Configured for answer_only generation, indicating a focus on producing concise and direct responses.
Context Length: Supports a maximum token length of 8192, with a maximum prompt token length of 6144 and a response length of 1024.
Reward Configuration: Incorporates a verification reward of 10.0 and applies a paper citation reward with a weight of 0.5, suggesting an emphasis on factual accuracy and source attribution.
Dataset: Trained and evaluated on the lihaoxin2020/refiner_rl dataset.

Potential Use Cases

Response Refinement: Ideal for applications where an initial model's output needs to be improved or made more precise.
Fact-Checking Assistance: The reward configuration, including paper citation rewards, suggests suitability for tasks requiring verifiable information.
Balanced Answer Generation: The 'balanced' aspect of its training implies it aims to provide well-rounded and unbiased answers.

Overview

Model Overview

Key Training Details

Potential Use Cases

Full Model Card (README)