Overview
RLHFlow/Qwen2.5-7B-SFT is a 7.6 billion parameter model, serving as an unofficial checkpoint for Supervised Fine-Tuning (SFT) within the RLHFlow project. It is built upon the Qwen2.5-MATH-7B-base architecture and is a precursor to models trained with PPO, iterative DPO, and rejection sampling (RAFT) for enhanced mathematical reasoning. The SFT process involved training on the MATH training set and Numina Math datasets.
Key Capabilities
- Mathematical Reasoning: Demonstrates strong performance in complex mathematical problem-solving, as evidenced by its use as a base for models that achieve significant improvements on benchmarks like AIME 2024, MATH 500, AMC, Minerva Math, and OlympiadBench.
- Foundation for RLHF: This SFT model is the initial step in a pipeline designed to produce highly capable math-focused models through advanced reinforcement learning techniques such as DPO and RAFT.
Training Details
The model was initially SFT-tuned on the RLHFlow/qwq_gen_sft_15k dataset, which includes the MATH training set. This SFT version is then used as a starting point for further optimization with methods like iterative DPO and RAFT, which have shown to achieve comparable or superior performance to PPO approaches in mathematical tasks.
Good For
- Developers looking for a strong base model for mathematical reasoning tasks.
- Researchers interested in replicating or extending reinforcement learning from human feedback (RLHF) methods, particularly DPO and RAFT, for mathematical problem-solving.
- Applications requiring robust performance on competitive math benchmarks.