Name: RLHFlow/Qwen2.5-7B-DPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: RLHFlow

Overview

RLHFlow/Qwen2.5-7B-DPO is a 7.6 billion parameter model derived from Qwen2.5-MATH-7B-base, developed by RLHFlow. It utilizes an iterative Direct Preference Optimization (DPO) training method, building upon the success of models like Deepseek-R1-Zero and PPO approaches. This model is specifically designed to enhance performance in mathematical reasoning and problem-solving.

Key Capabilities & Training

Mathematical Proficiency: The model shows significant improvements across various mathematical benchmarks, including AIME 2024, MATH 500, AMC, Minerva Math, and OlympiadBench.
Iterative DPO: Trained using an online iterative DPO method, which involves sampling multiple responses, ranking them with rule-based rewards, and optimizing the policy. This approach helps mitigate distribution shift and limited offline data coverage.
Base Model: Fine-tuned from Qwen2.5-Math-7B-Base with an additional Supervised Fine-Tuning (SFT) warm-up phase.
Context Length: Supports a substantial context length of 131,072 tokens, beneficial for complex, multi-step mathematical problems.

Performance Highlights

Compared to its base model, RLHFlow/Qwen2.5-7B-DPO achieves notable gains:

AIME 2024: 30.0 (+13.3) points
MATH 500: 84.4 (+32.0) points
Minerva Math: 33.5 (+20.6) points
OlympiadBench: 48.4 (+32.0) points

Use Cases

This model is particularly well-suited for applications requiring advanced mathematical reasoning, such as:

Automated problem-solving in competitive mathematics.
Educational tools for explaining complex mathematical concepts.
Research in AI for mathematical theorem proving and problem generation.

Overview

Overview

Key Capabilities & Training

Performance Highlights

Use Cases

Full Model Card (README)