Name: Aletheia-Bench/DPO-Think-1.5B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Aletheia-Bench

Aletheia-Bench/DPO-Think-1.5B: A Code Verifier Model

Aletheia-Bench/DPO-Think-1.5B is a 1.5 billion parameter code verifier model, part of the Aletheia project, which explores the effectiveness of Reinforcement Learning from Verifiable Rewards (RLVR) for code verifiers. This specific model is trained using Direct Preference Optimization (DPO), incorporating intermediate thinking traces and negative samples, but without on-policy training. It is designed to evaluate and rerank code generation outputs, particularly in scenarios where execution feedback is difficult to obtain.

Key Capabilities

Robust Code Verification: Trained to judge the correctness of code snippets, even in challenging out-of-distribution scenarios.
Thinking-based Training: Leverages intermediate thinking traces to enhance verification accuracy.
Offline Preference Optimization: Utilizes pre-collected thinking traces for efficient training.
Multi-domain Robustness: Evaluated across disparate policy models and covariate shifts using the Aletheia testbed.

Good For

RLHF / RLAIF: Serving as a plug-and-play reward function for optimizing code generation policies.
Automated Evaluation: Acting as an LLM-as-a-judge for various code-related tasks.
Research: Studying the impact of thinking traces, negative samples, and on-policy learning in code verifier training.

Overview

Aletheia-Bench/DPO-Think-1.5B: A Code Verifier Model

Key Capabilities

Good For

Full Model Card (README)