THU-KEG/PairJudge-RM
THU-KEG/PairJudge-RM is a 7.6 billion parameter reward model developed by THU-KEG, fine-tuned from Qwen-2.5-7B-Instruct. It specializes in pairwise judgment for mathematical reasoning tasks, utilizing chain-of-thought (CoT) to compare candidate solutions. This model is designed to enhance Best-of-N sampling by selecting the best answer through a knockout tournament strategy, offering a transparent and effective evaluation method.
Loading preview...
THU-KEG/PairJudge-RM: A Reward Model for Mathematical Reasoning
PairJudge RM is a 7.6 billion parameter reward model developed by THU-KEG, specifically designed to improve Best-of-N sampling for mathematical reasoning tasks. Unlike traditional reward models that assign absolute scores, PairJudge RM evaluates candidate solutions in pairs, determining which one is more correct through a transparent, step-by-step verification process.
Key Capabilities
- Pairwise Judgment: Compares two candidate solutions simultaneously to identify the superior one.
- Chain-of-Thought (CoT) Reasoning: Employs CoT to meticulously verify each step within the candidate solutions, providing clear and interpretable evaluations.
- Enhanced Best-of-N Sampling: Facilitates a knockout tournament strategy to select the optimal solution from multiple candidates, particularly beneficial for complex mathematical problems.
Model Architecture and Training
PairJudge RM is built upon a pre-trained language model, specifically fine-tuned from Qwen-2.5-7B-Instruct. It was trained on the extensive PAIRJUDGE-432K dataset using the Adam optimizer with a learning rate of 1×10⁻⁵, a batch size of 128, over 8 epochs.
Good For
- Evaluating mathematical problem-solving: Provides a robust method for assessing the correctness of different approaches to math problems.
- Improving LLM outputs for reasoning tasks: Can be integrated into workflows to select higher-quality responses from language models.
- Research in reward modeling: Offers a novel approach to reward signal generation based on comparative CoT reasoning.
For more technical details, refer to the PairJudge RM paper and the official code repository.