fenyo/Qwen2.5-7B-base2instruct

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kTool Calling:SupportedPublished:Jun 4, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

fenyo/Qwen2.5-7B-base2instruct is a 7.6 billion parameter instruction-tuned causal language model derived from the Qwen2.5-7B base model. Developed by fenyo, this model specializes in instruction following, outperforming the official Qwen2.5-7B-Instruct on IFEval and MMLU benchmarks. It was created through a unique SFT → DPO → RLVR pipeline, focusing on verifiable rewards for instruction adherence.

Loading preview...

Model Overview

fenyo/Qwen2.5-7B-base2instruct is a 7.6 billion parameter instruction-tuned model built upon the Qwen2.5-7B base. Its development focused on reproducing the base-to-instruct transformation using a unique SFT → DPO → RLVR pipeline on a single H100 GPU, aiming to derive reusable recipes for instruction tuning.

Key Capabilities & Performance

This model demonstrates strong instruction-following abilities, achieving a score of 75.0 on IFEval, surpassing the official Qwen2.5-7B-Instruct (71.9). It also performs well on general knowledge tasks, scoring 70.2 on MMLU (vs. 68.8 for the official instruct model). While excelling in instruction adherence, it shows slightly lower performance in mathematical reasoning (79.7 on GSM8K vs. 84.7 for the official instruct model).

Training Methodology Highlights

The training process involved three key stages:

  • Supervised Fine-Tuning (SFT): Used 300k examples from allenai/tulu-3-sft-mixture to teach ChatML format and assistant behavior, significantly improving IFEval from 27.4 to 51.2.
  • Direct Preference Optimization (DPO): Crucially, a targeted DPO using allenai/tulu-3-pref-personas-instruction-following (focused on instruction adherence) boosted IFEval from 51.2 to 68.9, highlighting the importance of task-specific data.
  • Reinforcement Learning with Verifiable Rewards (RLVR): Employed the GRPO algorithm with graduated rewards (multi-constraint prompts) to amplify instruction-following, further increasing IFEval to 75.0. This stage leveraged verifiable rewards for maths (GSM8K) and custom instruction constraints.

Use Cases

This model is particularly well-suited for applications requiring precise instruction following and general knowledge tasks. Its specialized training makes it a strong candidate for scenarios where adherence to specific directives is paramount, such as automated content generation with strict guidelines or complex query resolution.