dyyyyyyyy/FAPO-GenRM-4B
Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Oct 23, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Warm

The dyyyyyyyy/FAPO-GenRM-4B is a 4 billion parameter Generative Reward Model (GenRM) developed by Yuyang Ding, Chi Zhang, and others, as described in the FAPO research paper. This model is specifically designed to provide process-level rewards for Reinforcement Learning with Verifiable Rewards (RLVR) by accurately detecting and localizing reasoning errors. It is optimized for enhancing the reliability and efficiency of reasoning capabilities in large language models by identifying and penalizing 'flawed-positive' rollouts.

Loading preview...

FAPO-GenRM-4B: A Generative Reward Model for Reliable Reasoning

The dyyyyyyyy/FAPO-GenRM-4B is a 4 billion parameter Generative Reward Model (GenRM) introduced in the paper "FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning." Developed by Yuyang Ding, Chi Zhang, and their team, this model is central to the FAPO framework, which aims to improve the reasoning capabilities of large language models (LLMs) within the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm.

Key Capabilities

  • Flawed-Positive Rollout Detection: Accurately identifies and localizes flawed patterns in reasoning trajectories, such as answer-guessing or jump-in-reasoning, which can otherwise lead to unreliable policy optimization.
  • Process-Level Reward Generation: Provides precise, granular rewards that pinpoint specific reasoning errors, enabling more effective policy refinement.
  • Enhanced Reasoning Reliability: Contributes to a method that allows LLMs to leverage flawed-positive rollouts as shortcuts in early training stages while gradually shifting towards more reliable reasoning in later stages.
  • Training Stability: Designed to improve the stability of RLVR training by addressing the issues caused by flawed-positive signals.

Good for

  • Researchers and developers working on Reinforcement Learning with Verifiable Rewards (RLVR).
  • Improving the reliability and efficiency of reasoning in large language models.
  • Developing reward models that can detect and penalize specific reasoning flaws.
  • Exploring novel policy optimization techniques that adapt to different stages of training.