J1-7B-RL: An LLM-as-a-Judge with Enhanced Test-Time Scaling

J1-7B-RL is a 7.6 billion parameter model, built upon the Qwen2.5-7B-Base architecture, specifically engineered to function as an LLM-as-a-Judge. Its development involved a two-stage training process: initial Supervised Fine-Tuning (SFT) on a curated dataset (J1-SFT-53K) followed by Reinforcement Learning (RL) using the Reinforce++ algorithm on the English subset of the RISE dataset.

Key Capabilities and Features

Enhanced Reflective Reasoning: The model is trained to optimally utilize reflective reasoning tokens through its novel two-stage paradigm.
STTS Compatibility: J1-7B-RL demonstrates superior scaling behavior when integrated with Simple Test-Time Scaling (STTS) techniques, outperforming previous LLM-as-a-Judge models in this regard.
Improved Judgment Performance: It achieves a 4.8% improvement in overall judgment performance and exhibits a 5.1% stronger scaling trend under STTS, making it a more effective tool for evaluating LLM outputs.

Performance Highlights

Evaluated across four diverse preference datasets (RewardBench, RewardMath, Anthropic Harmless, CodePrefBench), J1-7B-RL (SFT + RL) significantly outperforms other state-of-the-art LLM-as-a-Judge models in its size class. Notably, it achieved 90.15 on RewardMath and 67.80 on CodePrefBench, contributing to an overall score of 75.98, surpassing models like Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct.

Usage

J1-7B-RL can be used both conventionally for direct LLM output evaluation or with the apply_stts function for enhanced performance through Simple Test-Time Scaling, as demonstrated in the provided code examples.

Overview

J1-7B-RL: An LLM-as-a-Judge with Enhanced Test-Time Scaling

Key Capabilities and Features

Performance Highlights

Usage

Full Model Card (README)