MiniLLM/VanillaKD-Pretrain-Qwen-500M

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.6BQuant:BF16Ctx Length:32kPublished:Oct 21, 2024License:apache-2.0Architecture:Transformer Open Weights Warm

VanillaKD-Pretrain-Qwen-500M is a 0.6 billion parameter language model developed by MiniLLM, utilizing the Qwen architecture. It was pre-trained using vanilla token-level knowledge distillation on the Pile dataset for 50 billion tokens, with Qwen1.5-1.8B serving as the teacher model. This model is specifically designed as a baseline for further MiniLLM-Qwen-500M developments and demonstrates improved performance for its computational scale.

Loading preview...

Model Overview

VanillaKD-Pretrain-Qwen-500M is a 0.6 billion parameter model built on the Qwen architecture, developed by MiniLLM. Its core innovation lies in its pre-training methodology, which employs vanilla token-level knowledge distillation. This process involved training on 50 billion tokens from the Pile dataset, leveraging the larger Qwen1.5-1.8B model as its teacher.

Key Characteristics

  • Knowledge Distillation: Utilizes a token-level knowledge distillation approach during pre-training to transfer knowledge from a larger teacher model (Qwen1.5-1.8B) to a smaller student model.
  • Qwen Architecture: Based on the efficient Qwen model architecture.
  • Pre-training Corpus: Trained on a substantial 50 billion tokens from the Pile dataset, with a tokenized version also open-sourced for reproducibility.
  • Baseline Model: Serves as a foundational baseline for the MiniLLM-Qwen-500M series, demonstrating how knowledge distillation can enhance performance in smaller models.

Performance

MiniPLM models, including this one, are noted for achieving better performance given the same computational resources and exhibiting good scalability across different model sizes, as illustrated in the provided evaluation charts.

Use Cases

This model is primarily intended for:

  • Research and Development: Ideal for researchers exploring knowledge distillation techniques and efficient pre-training methods for smaller language models.
  • Baseline Comparisons: Useful as a strong baseline for evaluating new distillation strategies or architectural modifications in the 0.6B parameter class.
  • Resource-Constrained Environments: Its optimized performance for its size makes it suitable for applications where computational resources are limited.