Name: lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: lihaoxin2020

Model Overview

The lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 is a 4 billion parameter model based on the Qwen3 architecture, representing an intermediate checkpoint (step 50) from a GRPO (Generative Reinforcement Learning with Policy Optimization) training run. Its core function is to act as a research-refiner policy, taking an agent's reasoning and search-tool output to generate a concise, citation-grounded refined answer.

Key Capabilities & Training

Refined Answer Generation: Produces answers where every factual claim is wrapped in <snippet id=ID1,ID2,...>claim</snippet> tags, citing only IDs present in the raw tool output.
Rubric-Grounded Training: Fine-tuned on lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric dataset, which includes per-instance positive and negative rubrics generated by GPT-5.4.
Hybrid Reward System: Utilizes a reward function combining an LLM rubric judge (Qwen3.5-35B-A3B) and a paper-citation quality metric (with a 0.2 weight for precision/recall/F1 over claim-level snippet attributions).
Context Handling: Supports a max_prompt_token_length of 6144 and response_length of 1024, with a total max_token_length of 8192.

Intended Use

This model is explicitly designated as a step-50 intermediate checkpoint. It is primarily intended for research and ablation studies related to:

Evaluating different rubric designs.
Analyzing the impact of citation weighting.
Experimenting with various judge models.

For production-quality refiner applications, the developers recommend using later steps from the same training run or successor runs that might exclude static rubrics. This checkpoint includes both per-instance and static V3 rubrics in its reward signal.

Overview

Model Overview

Key Capabilities & Training

Intended Use

Full Model Card (README)