lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Apr 20, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 is a 4 billion parameter Qwen3-based model, specifically a GRPO checkpoint (step 50) of a research-refiner policy. It is fine-tuned on GPT-5.4 rubric-grounded refiner data to produce concise, citation-grounded answers from agent reasoning and search-tool outputs. This model excels at generating structured responses with factual claims wrapped in snippet IDs, making it suitable for research into rubric design, citation weighting, and judge selection.

Loading preview...

Model Overview

The lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 is a 4 billion parameter model based on the Qwen3 architecture, representing an intermediate checkpoint (step 50) from a GRPO (Generative Reinforcement Learning with Policy Optimization) training run. Its core function is to act as a research-refiner policy, taking an agent's reasoning and search-tool output to generate a concise, citation-grounded refined answer.

Key Capabilities & Training

  • Refined Answer Generation: Produces answers where every factual claim is wrapped in <snippet id=ID1,ID2,...>claim</snippet> tags, citing only IDs present in the raw tool output.
  • Rubric-Grounded Training: Fine-tuned on lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric dataset, which includes per-instance positive and negative rubrics generated by GPT-5.4.
  • Hybrid Reward System: Utilizes a reward function combining an LLM rubric judge (Qwen3.5-35B-A3B) and a paper-citation quality metric (with a 0.2 weight for precision/recall/F1 over claim-level snippet attributions).
  • Context Handling: Supports a max_prompt_token_length of 6144 and response_length of 1024, with a total max_token_length of 8192.

Intended Use

This model is explicitly designated as a step-50 intermediate checkpoint. It is primarily intended for research and ablation studies related to:

  • Evaluating different rubric designs.
  • Analyzing the impact of citation weighting.
  • Experimenting with various judge models.

For production-quality refiner applications, the developers recommend using later steps from the same training run or successor runs that might exclude static rubrics. This checkpoint includes both per-instance and static V3 rubrics in its reward signal.