QpiEImitation/opd_math500_S-Qwen2-1.5B-Instruct_T-Qwen2-7B-Instruct

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 20, 2026Architecture:Transformer Cold

QpiEImitation/opd_math500_S-Qwen2-1.5B-Instruct_T-Qwen2-7B-Instruct is a 1.5 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2-1.5B-Instruct. This model was trained using On-Policy Distillation (GKD) to learn from self-generated mistakes, a method detailed in a 2024 ICLR paper. It is optimized for tasks where learning from iterative self-correction is beneficial, offering a distinct approach to model refinement.

Loading preview...

Model Overview

This model, opd_math500_S-Qwen2-1.5B-Instruct_T-Qwen2-7B-Instruct, is a 1.5 billion parameter instruction-tuned variant of the Qwen2-1.5B-Instruct architecture. It has been fine-tuned using the TRL framework, specifically employing a novel training procedure known as On-Policy Distillation (GKD).

Key Capabilities & Training

The primary differentiator of this model is its training methodology. GKD, as introduced in the paper "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes" (ICLR 2024), enables the model to learn and improve by analyzing its own generated errors. This approach aims to enhance the model's performance through iterative self-correction during the training phase.

When to Use This Model

  • Research into On-Policy Distillation: Ideal for researchers exploring advanced distillation techniques and self-correction mechanisms in language models.
  • Applications requiring robust instruction following: The GKD training method is designed to refine the model's ability to follow instructions by learning from its mistakes.
  • Experimentation with smaller, specialized models: As a 1.5B parameter model, it offers a more efficient option for certain tasks compared to larger models, while benefiting from a sophisticated fine-tuning approach.