TACReward7B is a 7 billion parameter reasoning-aware proxy reward model developed by BAELAB at Pusan National University. It is designed to provide fine-grained feedback on intermediate reasoning steps for language models, particularly in tasks like mathematical problem solving. The model uses process mining techniques to aggregate stepwise structural deviations between teacher and policy reasoning, producing a scalar reward output between 0 and 1. It is primarily intended to improve the structural quality of reasoning in sparse reward reinforcement learning frameworks.
No reviews yet. Be the first to review!