ilgee/Multiclass-Think-RM-8B

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:May 8, 2025License:llama3.1Architecture:Transformer Cold

ilgee/Multiclass-Think-RM-8B is an 8 billion parameter generative reward model developed by Ilgee Hong et al. It is fine-tuned from Llama-3.1-8B-Instruct and features a unique internal thinking process for long-horizon reasoning, enabling more nuanced and interpretable preference judgments. This model excels at evaluating complex, reasoning-intensive tasks by providing multiclass preference scores from -3 to 3.

Loading preview...

Overview

ilgee/Multiclass-Think-RM-8B is an 8 billion parameter generative reward model, fine-tuned from Llama-3.1-8B-Instruct. Developed by Ilgee Hong et al., this model introduces a novel approach to reward modeling by incorporating an internal thinking process, allowing for long-horizon reasoning before generating preference judgments. This distinguishes it from traditional Bradley-Terry models or shallow chain-of-thought generative reward models.

Key Capabilities

  • Long-horizon reasoning: Employs an internal deliberation mechanism for complex tasks.
  • Multiclass preference output: Provides a granular scoring system from -3 (Assistant A much better) to 3 (Assistant B much better), offering fine-grained assessment of preference strength.
  • Interpretable reasoning trajectories: The internal thinking process can lead to more understandable evaluation paths.
  • Strong performance: Designed to perform well on out-of-distribution and reasoning-heavy benchmarks, as detailed in the accompanying paper Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models.

Good for

  • Evaluating and comparing AI assistant responses in complex, reasoning-intensive scenarios.
  • Applications requiring nuanced and interpretable preference judgments.
  • Research into advanced reward modeling techniques and long-horizon reasoning in LLMs.