Harvard-DCML/boomerang-qwen3-2.3B
Harvard-DCML/boomerang-qwen3-2.3B is a 4 billion parameter student model distilled from Qwen3-4B-Base, developed by Harvard-DCML. This model leverages 'Boomerang distillation' to enable the creation of intermediate-sized models without additional training. It is specifically designed for model size interpolation, allowing developers to dynamically adjust model size by reincorporating layers from the teacher model. This approach offers flexibility in deploying models with varying computational requirements.
Loading preview...
Model Overview
The Harvard-DCML/boomerang-qwen3-2.3B model is a 4 billion parameter student model derived from Qwen3-4B-Base. It utilizes a novel technique called Boomerang distillation, which allows for the creation of intermediate-sized models by selectively reincorporating layers from a larger teacher model without requiring further training. This method is detailed in the paper "Boomerang Distillation Enables Zero-Shot Model Size Interpolation" (arXiv:2510.05064).
Key Capabilities
- Efficient Distillation: The model was initialized by copying every other layer and the last two layers from Qwen3-4B-Base.
- Activation Matching: Distilled on 2.1 billion tokens of The Pile, using cross-entropy, KL, and cosine loss to match the teacher model's activations.
- Model Size Interpolation: Designed to be used with the
build_intermediate_modelfunction from the dcml-lab/boomerang-distillation GitHub repository to create custom-sized models.
Training Details
The distillation process involved 500 training steps with a maximum sequence length of 2048 and an effective batch size of 2048. Key hyperparameters included a learning rate of 3e-4 with a cosine scheduler, AdamW optimizer, and specific weights for KLDiv and cosine distance losses.
Use Cases
This model is particularly useful for scenarios requiring flexible model deployment, where computational resources or latency constraints necessitate dynamically adjustable model sizes. It allows developers to fine-tune the model's capacity by interpolating between the student and teacher models.