Name: dyyyyyyyy/FAPO-GenRM-4B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: dyyyyyyyy

FAPO-GenRM-4B: A Generative Reward Model for Reliable Reasoning

The dyyyyyyyy/FAPO-GenRM-4B is a 4 billion parameter Generative Reward Model (GenRM) introduced in the paper "FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning." Developed by Yuyang Ding, Chi Zhang, and their team, this model is central to the FAPO framework, which aims to improve the reasoning capabilities of large language models (LLMs) within the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm.

Key Capabilities

Flawed-Positive Rollout Detection: Accurately identifies and localizes flawed patterns in reasoning trajectories, such as answer-guessing or jump-in-reasoning, which can otherwise lead to unreliable policy optimization.
Process-Level Reward Generation: Provides precise, granular rewards that pinpoint specific reasoning errors, enabling more effective policy refinement.
Enhanced Reasoning Reliability: Contributes to a method that allows LLMs to leverage flawed-positive rollouts as shortcuts in early training stages while gradually shifting towards more reliable reasoning in later stages.
Training Stability: Designed to improve the stability of RLVR training by addressing the issues caused by flawed-positive signals.

Good for

Researchers and developers working on Reinforcement Learning with Verifiable Rewards (RLVR).
Improving the reliability and efficiency of reasoning in large language models.
Developing reward models that can detect and penalize specific reasoning flaws.
Exploring novel policy optimization techniques that adapt to different stages of training.

Overview

FAPO-GenRM-4B: A Generative Reward Model for Reliable Reasoning

Key Capabilities

Good for

Full Model Card (README)