Name: Harvard-DCML/boomerang-qwen3-2.3B API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Harvard-DCML

Model Overview

The Harvard-DCML/boomerang-qwen3-2.3B model is a 4 billion parameter student model derived from Qwen3-4B-Base. It utilizes a novel technique called Boomerang distillation, which allows for the creation of intermediate-sized models by selectively reincorporating layers from a larger teacher model without requiring further training. This method is detailed in the paper "Boomerang Distillation Enables Zero-Shot Model Size Interpolation" (arXiv:2510.05064).

Key Capabilities

Efficient Distillation: The model was initialized by copying every other layer and the last two layers from Qwen3-4B-Base.
Activation Matching: Distilled on 2.1 billion tokens of The Pile, using cross-entropy, KL, and cosine loss to match the teacher model's activations.
Model Size Interpolation: Designed to be used with the build_intermediate_model function from the dcml-lab/boomerang-distillation GitHub repository to create custom-sized models.

Training Details

The distillation process involved 500 training steps with a maximum sequence length of 2048 and an effective batch size of 2048. Key hyperparameters included a learning rate of 3e-4 with a cosine scheduler, AdamW optimizer, and specific weights for KLDiv and cosine distance losses.

Use Cases

This model is particularly useful for scenarios requiring flexible model deployment, where computational resources or latency constraints necessitate dynamically adjustable model sizes. It allows developers to fine-tune the model's capacity by interpolating between the student and teacher models.

Overview

Model Overview

Key Capabilities

Training Details

Use Cases

Full Model Card (README)