Model Overview

This model, opd_math500_S-Qwen2-1.5B-Instruct_T-Qwen2-7B-Instruct, is a 1.5 billion parameter instruction-tuned variant of the Qwen2-1.5B-Instruct architecture. It has been fine-tuned using the TRL framework, specifically employing a novel training procedure known as On-Policy Distillation (GKD).

Key Capabilities & Training

The primary differentiator of this model is its training methodology. GKD, as introduced in the paper "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes" (ICLR 2024), enables the model to learn and improve by analyzing its own generated errors. This approach aims to enhance the model's performance through iterative self-correction during the training phase.

When to Use This Model

Research into On-Policy Distillation: Ideal for researchers exploring advanced distillation techniques and self-correction mechanisms in language models.
Applications requiring robust instruction following: The GKD training method is designed to refine the model's ability to follow instructions by learning from its mistakes.
Experimentation with smaller, specialized models: As a 1.5B parameter model, it offers a more efficient option for certain tasks compared to larger models, while benefiting from a sophisticated fine-tuning approach.

Overview

Model Overview

Key Capabilities & Training

When to Use This Model

Full Model Card (README)