Overview

This model, opd_math500_S-Qwen2-0.5B-Instruct_T-Qwen2-7B-Instruct, is a 0.5 billion parameter instruction-tuned language model built upon the Qwen2-0.5B-Instruct architecture. It has been fine-tuned using the GKD (On-Policy Distillation) method, as introduced in the paper "On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes" (ICLR 2024). This training approach allows the model to learn and refine its responses by analyzing and correcting its own errors.

Key Capabilities

Instruction Following: Enhanced through fine-tuning on an instruction dataset.
Distillation Learning: Utilizes the GKD method for improved learning efficiency and potentially better performance in specific domains by learning from self-generated mistakes.
Qwen2 Architecture: Benefits from the foundational capabilities of the Qwen2 model family.

Training Details

The model was trained using the TRL (Transformers Reinforcement Learning) library. The GKD method, central to its training, focuses on on-policy distillation, which is a technique for improving model performance by having it learn from its own generated outputs and subsequent corrections. This approach aims to create a more robust and accurate model, particularly for tasks where iterative refinement is beneficial.

When to Use This Model

This model is particularly suited for use cases where a compact, instruction-tuned model is desired, especially if the GKD training methodology aligns with the need for models that learn from their own outputs. Its 0.5 billion parameters make it efficient for deployment in environments with resource constraints, while its specialized training suggests potential for nuanced instruction following.

Overview

Overview

Key Capabilities

Training Details

When to Use This Model

Full Model Card (README)