Harsha901/qwen2.5-coder-3b-distilled-from-14b-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:May 10, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Harsha901/qwen2.5-coder-3b-distilled-from-14b-merged is a 3.1 billion parameter Qwen2.5-Coder model, created by Harsha901, that has been enhanced through Generalized Knowledge Distillation (GKD) from the larger Qwen2.5-Coder-14B-Instruct teacher model. This merged model, with a 32K context length, demonstrates improved performance on Python code generation tasks, specifically achieving an 83.5% pass@1 on HumanEval. It is optimized for efficient code generation and instruction following, particularly in Python.

Loading preview...

Model Overview

This model, Harsha901/qwen2.5-coder-3b-distilled-from-14b-merged, is a 3.1 billion parameter Qwen2.5-Coder-3B-Instruct variant. It has been significantly improved through Generalized Knowledge Distillation (GKD) from the Qwen2.5-Coder-14B-Instruct teacher model. The distillation process involved training a LoRA adapter using TRL's DistillationTrainer on a Python code instruction dataset, which was then merged back into the base weights.

Key Capabilities & Performance

  • Enhanced Code Generation: Achieves an 83.5% pass@1 on the HumanEval benchmark for Python programming tasks, representing a +2.44 percentage point improvement over the base Qwen2.5-Coder-3B-Instruct model (81.1%).
  • Reasoning Improvement: Shows gains in reasoning-heavy tasks such as string manipulation, cipher encoding, and date parsing.
  • Efficient Deployment: As a merged model, it loads exactly like the original 3B model without requiring adapter plumbing, making it straightforward to use.
  • Memory Footprint: Requires approximately 6.5 GB VRAM in bf16 precision or 2.5 GB in 4-bit NF4 quantization, making it suitable for consumer GPUs.

Training Details

The model was trained for 300 steps on the iamtarun/python_code_instructions_18k_alpaca dataset using a single NVIDIA A100 80 GB GPU. The GKD loss function incorporated both on-policy (25%) and teacher-forced (75%) sequences with a symmetric Jensen–Shannon divergence. The student model was Qwen/Qwen2.5-Coder-3B-Instruct with LoRA (r=16, α=32), and the teacher was Qwen/Qwen2.5-Coder-14B-Instruct loaded in 4-bit NF4.

Limitations & Future Work

  • Limited Training Steps: Only 300 steps were completed; longer training is expected to yield further improvements.
  • Arithmetic Edge Cases: Experienced slight regressions on precise numeric/sequence edge cases.
  • Python-only Training: Generalization to other programming languages is untested.
  • Distillation Dataset: The training dataset focuses on general Python instruction-following, not competitive algorithmic problems like HumanEval.