RLHFlow/Qwen2.5-7B-DPO

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Feb 16, 2025Architecture:Transformer Cold

RLHFlow/Qwen2.5-7B-DPO is a 7.6 billion parameter language model developed by RLHFlow, fine-tuned from Qwen2.5-MATH-7B-base using an iterative Direct Preference Optimization (DPO) approach. This model is specifically optimized for mathematical reasoning tasks, demonstrating significant performance enhancements on benchmarks like AIME 2024, MATH 500, and OlympiadBench. It excels at complex problem-solving in mathematics, making it suitable for applications requiring advanced numerical and logical capabilities.

Loading preview...

Overview

RLHFlow/Qwen2.5-7B-DPO is a 7.6 billion parameter model derived from Qwen2.5-MATH-7B-base, developed by RLHFlow. It utilizes an iterative Direct Preference Optimization (DPO) training method, building upon the success of models like Deepseek-R1-Zero and PPO approaches. This model is specifically designed to enhance performance in mathematical reasoning and problem-solving.

Key Capabilities & Training

  • Mathematical Proficiency: The model shows significant improvements across various mathematical benchmarks, including AIME 2024, MATH 500, AMC, Minerva Math, and OlympiadBench.
  • Iterative DPO: Trained using an online iterative DPO method, which involves sampling multiple responses, ranking them with rule-based rewards, and optimizing the policy. This approach helps mitigate distribution shift and limited offline data coverage.
  • Base Model: Fine-tuned from Qwen2.5-Math-7B-Base with an additional Supervised Fine-Tuning (SFT) warm-up phase.
  • Context Length: Supports a substantial context length of 131,072 tokens, beneficial for complex, multi-step mathematical problems.

Performance Highlights

Compared to its base model, RLHFlow/Qwen2.5-7B-DPO achieves notable gains:

  • AIME 2024: 30.0 (+13.3) points
  • MATH 500: 84.4 (+32.0) points
  • Minerva Math: 33.5 (+20.6) points
  • OlympiadBench: 48.4 (+32.0) points

Use Cases

This model is particularly well-suited for applications requiring advanced mathematical reasoning, such as:

  • Automated problem-solving in competitive mathematics.
  • Educational tools for explaining complex mathematical concepts.
  • Research in AI for mathematical theorem proving and problem generation.