test-time-scaling/J1_7B_RL

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:May 21, 2025Architecture:Transformer0.0K Cold

J1-7B-RL is a 7.6 billion parameter LLM-as-a-Judge model, based on Qwen2.5-7B-Base, developed through Supervised Fine-Tuning and Reinforcement Learning. It is specifically designed to leverage Simple Test-Time Scaling (STTS) for enhanced reflective reasoning and improved judgment performance. This model serves as an improved preference judge for evaluating the quality of other LLM outputs, demonstrating superior scaling behavior under STTS.

Loading preview...

J1-7B-RL: An LLM-as-a-Judge with Enhanced Test-Time Scaling

J1-7B-RL is a 7.6 billion parameter model, built upon the Qwen2.5-7B-Base architecture, specifically engineered to function as an LLM-as-a-Judge. Its development involved a two-stage training process: initial Supervised Fine-Tuning (SFT) on a curated dataset (J1-SFT-53K) followed by Reinforcement Learning (RL) using the Reinforce++ algorithm on the English subset of the RISE dataset.

Key Capabilities and Features

  • Enhanced Reflective Reasoning: The model is trained to optimally utilize reflective reasoning tokens through its novel two-stage paradigm.
  • STTS Compatibility: J1-7B-RL demonstrates superior scaling behavior when integrated with Simple Test-Time Scaling (STTS) techniques, outperforming previous LLM-as-a-Judge models in this regard.
  • Improved Judgment Performance: It achieves a 4.8% improvement in overall judgment performance and exhibits a 5.1% stronger scaling trend under STTS, making it a more effective tool for evaluating LLM outputs.

Performance Highlights

Evaluated across four diverse preference datasets (RewardBench, RewardMath, Anthropic Harmless, CodePrefBench), J1-7B-RL (SFT + RL) significantly outperforms other state-of-the-art LLM-as-a-Judge models in its size class. Notably, it achieved 90.15 on RewardMath and 67.80 on CodePrefBench, contributing to an overall score of 75.98, surpassing models like Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct.

Usage

J1-7B-RL can be used both conventionally for direct LLM output evaluation or with the apply_stts function for enhanced performance through Simple Test-Time Scaling, as demonstrated in the provided code examples.