lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50
The lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 is a 4 billion parameter Qwen3-based model, specifically a GRPO checkpoint (step 50) of a research-refiner policy. It is fine-tuned on GPT-5.4 rubric-grounded refiner data to produce concise, citation-grounded answers from agent reasoning and search-tool outputs. This model excels at generating structured responses with factual claims wrapped in snippet IDs, making it suitable for research into rubric design, citation weighting, and judge selection.
Loading preview...
Model Overview
The lihaoxin2020/qwen3-4b-refiner-gpt54-instance-rubric-gpt54-grpo-step50 is a 4 billion parameter model based on the Qwen3 architecture, representing an intermediate checkpoint (step 50) from a GRPO (Generative Reinforcement Learning with Policy Optimization) training run. Its core function is to act as a research-refiner policy, taking an agent's reasoning and search-tool output to generate a concise, citation-grounded refined answer.
Key Capabilities & Training
- Refined Answer Generation: Produces answers where every factual claim is wrapped in
<snippet id=ID1,ID2,...>claim</snippet>tags, citing only IDs present in the raw tool output. - Rubric-Grounded Training: Fine-tuned on
lihaoxin2020/rl_hard_gpt5_sft_gpt54rubricdataset, which includes per-instance positive and negative rubrics generated by GPT-5.4. - Hybrid Reward System: Utilizes a reward function combining an LLM rubric judge (Qwen3.5-35B-A3B) and a paper-citation quality metric (with a 0.2 weight for precision/recall/F1 over claim-level snippet attributions).
- Context Handling: Supports a
max_prompt_token_lengthof 6144 andresponse_lengthof 1024, with a totalmax_token_lengthof 8192.
Intended Use
This model is explicitly designated as a step-50 intermediate checkpoint. It is primarily intended for research and ablation studies related to:
- Evaluating different rubric designs.
- Analyzing the impact of citation weighting.
- Experimenting with various judge models.
For production-quality refiner applications, the developers recommend using later steps from the same training run or successor runs that might exclude static rubrics. This checkpoint includes both per-instance and static V3 rubrics in its reward signal.