Gen-Verse/ReasonFlux-PRM-1.5B

Warm
Public
1.5B
BF16
32768
1
Jun 9, 2025
License: mit
Hugging Face
Overview

ReasonFlux-PRM-1.5B Overview

ReasonFlux-PRM-1.5B is a 1.5 billion parameter trajectory-aware process reward model (PRM) developed by Gen-Verse. It is specifically engineered to evaluate the quality of reasoning traces, incorporating both step-level and trajectory-level supervision to provide fine-grained reward signals. This model is particularly adept at aligning with structured chain-of-thought data, making it a valuable tool for enhancing the reasoning capabilities of larger language models.

Key Capabilities

  • Trajectory-aware Scoring: Explicitly designed to assess the entire reasoning path, not just the final answer.
  • Online/Offline Supervision: Supports flexible reward supervision methods, enabling its use in various training paradigms.
  • Dense Process Rewards: Provides detailed, step-by-step feedback for policy optimization during reinforcement learning.
  • Lightweight and Efficient: With 1.5 billion parameters, it offers efficient inference, making it suitable for resource-constrained environments and edge deployment.

Good For

  • Data Selection: Identifying high-quality training data for model distillation.
  • Reinforcement Learning Training: Providing dense process-level rewards to guide policy optimization.
  • Test-Time Scaling: Enabling reward-guided scaling during inference.
  • Resource-Constrained Applications: Its efficient design makes it ideal for scenarios where computational resources are limited.