QpiEImitation/gkd_math500_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 20, 2026Architecture:Transformer Cold

QpiEImitation/gkd_math500_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct is a 3.09 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. This model was trained using the GKD (On-Policy Distillation of Language Models) method, which focuses on learning from self-generated mistakes. It is designed for general text generation tasks, leveraging its specialized training procedure to enhance performance.

Loading preview...

Overview

This model, gkd_math500_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct, is a 3.09 billion parameter instruction-tuned language model. It is a fine-tuned variant of the Qwen/Qwen2.5-3B-Instruct base model, developed by Qwen. The fine-tuning process utilized the TRL library and incorporated a specific training methodology known as GKD.

Key Capabilities

  • Instruction Following: As an instruction-tuned model, it is designed to generate responses based on given prompts and instructions.
  • GKD Training: The model's unique characteristic is its training with GKD (On-Policy Distillation of Language Models), a method that enables the model to learn effectively from its own generated errors. This approach aims to improve the model's robustness and performance by iteratively refining its understanding.

Good For

  • General Text Generation: Suitable for a wide range of text generation tasks where instruction following is important.
  • Research into Distillation Methods: Provides a practical example of a model trained with the GKD distillation technique, which could be valuable for researchers exploring advanced training methodologies.

Training Details

The model was trained using TRL version 1.0.0.dev0, Transformers 5.3.0, Pytorch 2.6.0+cu124, Datasets 4.8.2, and Tokenizers 0.22.2. Further details on the training run can be visualized via Weights & Biases.