SUSTech-NLP/UniRRM-8B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 8, 2026License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

UniRRM-8B by SUSTech-NLP is an 8 billion parameter unified reasoning reward model built on Qwen3-8B, supporting 103 languages and multiple evaluation paradigms (pairwise, listwise, pointwise). It employs a three-stage structured reasoning workflow for dynamic rubric generation and detailed evaluation. This model is designed for objective, multi-dimensional evaluation of AI-generated responses across diverse linguistic and task contexts.

Loading preview...

UniRRM-8B: Unified Reasoning Reward Model

UniRRM-8B, developed by SUSTech-NLP, is an 8 billion parameter model based on Qwen3-8B, designed as a unified reasoning reward model. Its core innovation lies in supporting 103 languages and multiple evaluation paradigms (pairwise, listwise, and pointwise) within a single model.

Key Capabilities

  • Multilingual Evaluation: Processes and evaluates responses across 103 languages, trained on diverse multilingual data.
  • Unified Evaluation Paradigms: Adapts to different comparison formats, allowing for flexible assessment of AI outputs.
  • Adaptive Rubric Generation: Dynamically creates both task-generic and instruction-specific evaluation criteria, each with a 1-5 scoring scale.
  • Structured Reasoning: Utilizes a three-stage process: Deep Analysis (identifying task intent and risks), Adaptive Rubric Generation, and Detailed Evaluation (applying rubrics with evidence extraction and scoring).
  • Efficient Performance: Delivers strong evaluation capabilities within a compact 8B parameter size.

Training and Architecture

The model was trained using a two-stage pipeline: Supervised Fine-Tuning (SFT) on the UniRRM-SFT dataset (35,749 samples distilled from GPT-OSS-120B) to establish structured reasoning, followed by Reinforcement Learning with Group Relative Policy Optimization (GRPO) on the UniRRM-RL dataset (32,832 samples). It uses a Qwen3ForCausalLM architecture with bfloat16 precision and a maximum context length of 40960 tokens. The training incorporated a composite reward function focusing on format compliance, outcome consistency, and rubric quality.

Use Cases

UniRRM-8B is ideal for developers and researchers needing a robust, multilingual, and flexible model to objectively evaluate the quality of AI-generated text, particularly for tasks requiring detailed, criterion-based assessment across various languages and comparison formats.