fenyo/Qwen2.5-7B-base2instruct
fenyo/Qwen2.5-7B-base2instruct is a 7.6 billion parameter instruction-tuned causal language model derived from the Qwen2.5-7B base model. Developed by fenyo, this model specializes in instruction following, outperforming the official Qwen2.5-7B-Instruct on IFEval and MMLU benchmarks. It was created through a unique SFT → DPO → RLVR pipeline, focusing on verifiable rewards for instruction adherence.
Loading preview...
Model Overview
fenyo/Qwen2.5-7B-base2instruct is a 7.6 billion parameter instruction-tuned model built upon the Qwen2.5-7B base. Its development focused on reproducing the base-to-instruct transformation using a unique SFT → DPO → RLVR pipeline on a single H100 GPU, aiming to derive reusable recipes for instruction tuning.
Key Capabilities & Performance
This model demonstrates strong instruction-following abilities, achieving a score of 75.0 on IFEval, surpassing the official Qwen2.5-7B-Instruct (71.9). It also performs well on general knowledge tasks, scoring 70.2 on MMLU (vs. 68.8 for the official instruct model). While excelling in instruction adherence, it shows slightly lower performance in mathematical reasoning (79.7 on GSM8K vs. 84.7 for the official instruct model).
Training Methodology Highlights
The training process involved three key stages:
- Supervised Fine-Tuning (SFT): Used 300k examples from
allenai/tulu-3-sft-mixtureto teach ChatML format and assistant behavior, significantly improving IFEval from 27.4 to 51.2. - Direct Preference Optimization (DPO): Crucially, a targeted DPO using
allenai/tulu-3-pref-personas-instruction-following(focused on instruction adherence) boosted IFEval from 51.2 to 68.9, highlighting the importance of task-specific data. - Reinforcement Learning with Verifiable Rewards (RLVR): Employed the GRPO algorithm with graduated rewards (multi-constraint prompts) to amplify instruction-following, further increasing IFEval to 75.0. This stage leveraged verifiable rewards for maths (GSM8K) and custom instruction constraints.
Use Cases
This model is particularly well-suited for applications requiring precise instruction following and general knowledge tasks. Its specialized training makes it a strong candidate for scenarios where adherence to specific directives is paramount, such as automated content generation with strict guidelines or complex query resolution.