Overview
The continuum-ai/qwen2.5-coder-7b-compacted is a 7.6 billion parameter model derived from the Qwen2.5-Coder-7B base, developed by continuum-ai. Its primary distinction lies in its compaction through 12% head pruning (activation-magnitude) and subsequent KL-distillation compensation using LoRA. This process aims to reduce model size while recovering performance to within calibration tolerance of the unmodified base model.
Key Capabilities & Performance
- Code Generation: Optimized for English-language Python code completion.
- Benchmarking: Achieves a HumanEval score of 61.0 (compared to the base's 62.2) and a HumanEval+ score of 53.0 (base 53.7), demonstrating effective performance recovery post-pruning.
- Provenance: Features cryptographic provenance via the ForgeAlloy chain of custody, ensuring verifiable claims and results.
- Compaction Methodology: Utilizes a
prune → lora → lora → eval pipeline, with full details available in the methodology paper.
Limitations & Use Cases
This model serves as a methodology demonstration for model compaction rather than a production-ready, Pareto-optimal artifact for all hardware tiers. It is validated specifically for English-language Python code completion and has not been measured for other programming languages, code paradigms, or code-adjacent domains. It is currently text-only, with vision modality not yet integrated. For production code workloads on smaller hardware, the unmodified Qwen2.5-Coder-7B at standard quantization might be a better fit, pending future larger Qwen3.5+ forges that fully leverage the pruning dimension.