QpiEImitation/gkd_gsm8k_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Apr 20, 2026Architecture:Transformer Cold

QpiEImitation/gkd_gsm8k_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct is a 3.1 billion parameter instruction-tuned causal language model based on the Qwen2.5-3B-Instruct architecture. This model has been fine-tuned using the GKD (On-Policy Distillation of Language Models) method, which focuses on learning from self-generated mistakes. It is optimized for tasks requiring improved reasoning and problem-solving capabilities, particularly in areas like mathematical reasoning as suggested by its GSM8K context.

Loading preview...

Overview

This model, gkd_gsm8k_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct, is a fine-tuned version of the Qwen2.5-3B-Instruct base model, developed by QpiEImitation. It leverages the Qwen2.5-3B-Instruct architecture, featuring 3.1 billion parameters and a context length of 32768 tokens. The fine-tuning process utilized the TRL library and implemented the GKD (On-Policy Distillation of Language Models) method.

Key Capabilities

  • Enhanced Reasoning: The GKD training method, detailed in the paper "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes," aims to improve the model's ability to learn from its own errors, potentially leading to more robust reasoning. This is particularly relevant for tasks like mathematical problem-solving, as indicated by the gsm8k in its name.
  • Instruction Following: As an instruction-tuned model, it is designed to understand and execute user prompts effectively.

Training Methodology

The model was trained using GKD, a distillation technique where a smaller student model learns from a larger teacher model's self-generated mistakes. This approach is designed to enhance the student model's performance by focusing on challenging examples.

Good For

  • Applications requiring a compact yet capable model for reasoning tasks.
  • Research and experimentation with on-policy distillation methods like GKD.
  • Tasks that benefit from improved instruction following and problem-solving, potentially including mathematical word problems (GSM8K).

Limitations

As a 3.1 billion parameter model, it may not match the performance of much larger models on highly complex or open-ended generative tasks.