Gen-Verse/ReasonFlux-PRM-Qwen-2.5-7B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 22, 2025License:mitArchitecture:Transformer0.0K Open Weights Warm

Gen-Verse/ReasonFlux-PRM-Qwen-2.5-7B is a 7 billion parameter end-to-end trained policy model developed by Gen-Verse, specifically designed for long Chain-of-Thought (CoT) reasoning. It excels at solving complex tasks and problems, particularly in math and science reasoning. This model leverages a trajectory-aware process reward model (PRM) for data selection and reinforcement learning, enabling fine-grained reward assignment aligned with structured reasoning traces.

Loading preview...

ReasonFlux-PRM-Qwen-2.5-7B Overview

Gen-Verse/ReasonFlux-PRM-Qwen-2.5-7B is a 7 billion parameter policy model, developed by Gen-Verse, that has undergone an end-to-end training process to enhance its long Chain-of-Thought (CoT) reasoning capabilities. This model is particularly adept at tackling complex problems in domains such as mathematics and science. Its training methodology involves initial Supervised Fine-Tuning (SFT) on 1,000 high-quality Trajectory-Response pairs, which were selected using the ReasonFlux-PRM-7B model. This SFT phase is followed by Reinforcement Learning (RL) training, integrating the ReasonFlux-PRM-7B with GRPO.

Key Capabilities

  • Long CoT Reasoning: Specialized in generating and evaluating extended, multi-step reasoning chains.
  • Complex Problem Solving: Highly effective for tasks requiring intricate logical deduction and problem-solving.
  • Trajectory-Aware Training: Benefits from a process reward model (PRM) that evaluates reasoning traces at both step and trajectory levels, ensuring high-quality training data and dense process-level rewards.

Good for

  • Math and Science Reasoning: Ideal for applications requiring robust analytical and problem-solving skills in technical fields.
  • Reinforcement Learning (RL) Integration: Can be used as a policy model within RL frameworks, leveraging its process-level reward capabilities.
  • Structured Reasoning Tasks: Suitable for scenarios where explicit, verifiable reasoning steps are crucial for task completion.