Harvard-DCML/boomerang-llama-3.2-1.9B

TEXT GENERATIONConcurrency Cost:1Model Size:3.2BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Oct 10, 2025License:llama3.2Architecture:Transformer Cold

The Harvard-DCML/boomerang-llama-3.2-1.9B is a 1.9 billion parameter Llama student model derived from Llama-3.2-3B through a novel "boomerang distillation" process. This model is designed to enable zero-shot model size interpolation, allowing for the creation of intermediate-sized models without additional training. It was distilled on 2.1 billion tokens of The Pile, matching teacher model activations using cross-entropy, KL, and cosine loss. Its primary utility lies in its ability to be combined with its teacher model to generate custom-sized LLMs.

Loading preview...

Model Overview

The Harvard-DCML/boomerang-llama-3.2-1.9B is a 1.9 billion parameter student model, part of the Llama family, developed by Harvard-DCML. It is a product of "boomerang distillation," a technique that allows for the creation of intermediate-sized models by reincorporating layers from a teacher model (Llama-3.2-3B) into a student model without further training. This process enables zero-shot model size interpolation, offering flexibility in model deployment.

Training Details

This model was initialized from Llama-3.2-3B by copying specific layers and then distilled on 2.1 billion tokens from The Pile dataset. The distillation process involved matching the activations of the Llama-3.2-3B teacher model using a combination of cross-entropy, KL, and cosine loss functions. Key training hyperparameters included a learning rate of 3e-4, a cosine learning rate scheduler, and an effective batch size of 2048 over 500 training steps.

Key Capabilities and Use Cases

The primary utility of this model is its role in the boomerang distillation framework. Developers can use it in conjunction with its teacher model, Llama-3.2-3B, to dynamically construct custom-sized language models. This is achieved through a provided build_intermediate_model function, allowing for precise control over the number of patched layers to adjust the resulting model's size and performance characteristics. This approach is particularly beneficial for optimizing model size for specific computational constraints or performance requirements without the need for extensive retraining.