MiniLLM/MiniPLM-Qwen-500M

Warm
Public
0.6B
BF16
32768
Oct 17, 2024
License: apache-2.0
Hugging Face
Overview

MiniPLM-Qwen-500M Overview

MiniPLM-Qwen-500M is a 0.6 billion parameter language model built on the Qwen architecture, pre-trained from scratch using the innovative MiniPLM knowledge distillation (KD) framework. This model utilizes the larger Qwen1.5-1.8B as a teacher model, allowing for efficient and flexible training of smaller student LMs.

Key Capabilities & Features

  • Knowledge Distillation: Employs the MiniPLM framework for pre-training, which refines the training data distribution using knowledge from a larger teacher model.
  • Efficiency: Achieves KD through offline teacher LM inference, significantly reducing computational costs during student model training and enabling KD across different model families.
  • Enhanced Performance: Demonstrates improved performance on 9 widely used downstream tasks and better language modeling capabilities compared to conventional pre-training methods.
  • Scalability: The MiniPLM approach scales effectively across various model sizes, showing consistent benefits in performance relative to computational resources.
  • Data Refinement: Leverages differences between large and small LMs to enhance the difficulty and diversity of training data, helping student LMs acquire versatile knowledge.

When to Use This Model

  • Resource-Constrained Environments: Ideal for scenarios where computational resources are limited but high performance is still desired, thanks to its efficient KD pre-training.
  • Building Smaller, Capable LMs: Suitable for developers looking to create compact language models that retain significant capabilities derived from larger, more powerful teachers.
  • Research in Knowledge Distillation: Provides a practical example and open-source resources (paper, code, pre-training corpus) for researchers exploring advanced KD techniques for pre-training LMs.