QpiEImitation/gkd_gsm8k_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct
QpiEImitation/gkd_gsm8k_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct is a 3.1 billion parameter instruction-tuned causal language model based on the Qwen2.5-3B-Instruct architecture. This model has been fine-tuned using the GKD (On-Policy Distillation of Language Models) method, which focuses on learning from self-generated mistakes. It is optimized for tasks requiring improved reasoning and problem-solving capabilities, particularly in areas like mathematical reasoning as suggested by its GSM8K context.
Loading preview...
Overview
This model, gkd_gsm8k_S-Qwen2.5-3B-Instruct_T-Qwen2-7B-Instruct, is a fine-tuned version of the Qwen2.5-3B-Instruct base model, developed by QpiEImitation. It leverages the Qwen2.5-3B-Instruct architecture, featuring 3.1 billion parameters and a context length of 32768 tokens. The fine-tuning process utilized the TRL library and implemented the GKD (On-Policy Distillation of Language Models) method.
Key Capabilities
- Enhanced Reasoning: The GKD training method, detailed in the paper "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes," aims to improve the model's ability to learn from its own errors, potentially leading to more robust reasoning. This is particularly relevant for tasks like mathematical problem-solving, as indicated by the
gsm8kin its name. - Instruction Following: As an instruction-tuned model, it is designed to understand and execute user prompts effectively.
Training Methodology
The model was trained using GKD, a distillation technique where a smaller student model learns from a larger teacher model's self-generated mistakes. This approach is designed to enhance the student model's performance by focusing on challenging examples.
Good For
- Applications requiring a compact yet capable model for reasoning tasks.
- Research and experimentation with on-policy distillation methods like GKD.
- Tasks that benefit from improved instruction following and problem-solving, potentially including mathematical word problems (GSM8K).
Limitations
As a 3.1 billion parameter model, it may not match the performance of much larger models on highly complex or open-ended generative tasks.