Aletheia-Bench/DPO-Think-1.5B

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Nov 8, 2025License:cc-by-nc-sa-4.0Architecture:Transformer Open Weights Warm

Aletheia-Bench/DPO-Think-1.5B is a 1.5 billion parameter code verifier model developed by Aletheia-Bench, fine-tuned with a context length of 32768 tokens. This model is trained using Direct Preference Optimization (DPO) with intermediate thinking traces and negative samples, designed to robustly rate and rerank code generation outputs. It excels as a plug-and-play reward function for code generation policy optimization and automated evaluation in code-related tasks.

Loading preview...

Aletheia-Bench/DPO-Think-1.5B: A Code Verifier Model

Aletheia-Bench/DPO-Think-1.5B is a 1.5 billion parameter code verifier model, part of the Aletheia project, which explores the effectiveness of Reinforcement Learning from Verifiable Rewards (RLVR) for code verifiers. This specific model is trained using Direct Preference Optimization (DPO), incorporating intermediate thinking traces and negative samples, but without on-policy training. It is designed to evaluate and rerank code generation outputs, particularly in scenarios where execution feedback is difficult to obtain.

Key Capabilities

  • Robust Code Verification: Trained to judge the correctness of code snippets, even in challenging out-of-distribution scenarios.
  • Thinking-based Training: Leverages intermediate thinking traces to enhance verification accuracy.
  • Offline Preference Optimization: Utilizes pre-collected thinking traces for efficient training.
  • Multi-domain Robustness: Evaluated across disparate policy models and covariate shifts using the Aletheia testbed.

Good For

  • RLHF / RLAIF: Serving as a plug-and-play reward function for optimizing code generation policies.
  • Automated Evaluation: Acting as an LLM-as-a-judge for various code-related tasks.
  • Research: Studying the impact of thinking traces, negative samples, and on-policy learning in code verifier training.