VanillaKD-Pretrain-Qwen-500M is a 0.6 billion parameter language model developed by MiniLLM, utilizing the Qwen architecture. It was pre-trained using vanilla token-level knowledge distillation on the Pile dataset for 50 billion tokens, with Qwen1.5-1.8B serving as the teacher model. This model is specifically designed as a baseline for further MiniLLM-Qwen-500M developments and demonstrates improved performance for its computational scale.
No reviews yet. Be the first to review!