SUSTech-NLP/UniRRM-8B
UniRRM-8B by SUSTech-NLP is an 8 billion parameter unified reasoning reward model built on Qwen3-8B, supporting 103 languages and multiple evaluation paradigms (pairwise, listwise, pointwise). It employs a three-stage structured reasoning workflow for dynamic rubric generation and detailed evaluation. This model is designed for objective, multi-dimensional evaluation of AI-generated responses across diverse linguistic and task contexts.
Loading preview...
UniRRM-8B: Unified Reasoning Reward Model
UniRRM-8B, developed by SUSTech-NLP, is an 8 billion parameter model based on Qwen3-8B, designed as a unified reasoning reward model. Its core innovation lies in supporting 103 languages and multiple evaluation paradigms (pairwise, listwise, and pointwise) within a single model.
Key Capabilities
- Multilingual Evaluation: Processes and evaluates responses across 103 languages, trained on diverse multilingual data.
- Unified Evaluation Paradigms: Adapts to different comparison formats, allowing for flexible assessment of AI outputs.
- Adaptive Rubric Generation: Dynamically creates both task-generic and instruction-specific evaluation criteria, each with a 1-5 scoring scale.
- Structured Reasoning: Utilizes a three-stage process: Deep Analysis (identifying task intent and risks), Adaptive Rubric Generation, and Detailed Evaluation (applying rubrics with evidence extraction and scoring).
- Efficient Performance: Delivers strong evaluation capabilities within a compact 8B parameter size.
Training and Architecture
The model was trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) on the UniRRM-SFT dataset (35,749 samples distilled from GPT-OSS-120B) to establish structured reasoning, followed by Reinforcement Learning with Group Relative Policy Optimization (GRPO) on the UniRRM-RL dataset (32,832 samples). It uses a Qwen3ForCausalLM architecture with bfloat16 precision and a maximum context length of 40960 tokens. The training incorporated a composite reward function focusing on format compliance, outcome consistency, and rubric quality.
Use Cases
UniRRM-8B is ideal for developers and researchers needing a robust, multilingual, and flexible model to objectively evaluate the quality of AI-generated text, particularly for tasks requiring detailed, criterion-based assessment across various languages and comparison formats.