Thrillcrazyer/TACReward7B

Warm
Public
7.6B
FP8
131072
Oct 20, 2025
Hugging Face
Overview

Overview

TACReward7B is a 7 billion parameter reasoning-aware proxy reward model developed by BAELAB at Pusan National University. It addresses the limitations of binarized outcome rewards in sparse reward policy gradient methods, especially for complex reasoning tasks such as mathematical problem solving. Unlike models that estimate overall reasoning quality, TACReward treats reasoning as a structured process, providing feedback on intermediate steps.

Key Capabilities

  • Reasoning-Aware Rewards: Generates a scalar reward (0-1) by comparing stepwise structural deviations between teacher and policy reasoning.
  • Process Mining Integration: Utilizes process mining techniques to analyze and quantify the quality of reasoning steps.
  • Seamless Integration: Designed to be integrated into existing sparse reward frameworks without requiring additional human annotation or architectural modifications.
  • Improved Reasoning Quality: Experiments on mathematical reasoning benchmarks demonstrate that integrating TACReward consistently improves the structural quality of reasoning and overall performance of policy models.

Good For

  • Fine-tuning language models for reasoning tasks using reinforcement learning.
  • Enhancing feedback mechanisms in sparse reward environments where intermediate reasoning steps are crucial.
  • Applications requiring structured and verifiable reasoning, such as mathematical problem solving.