Name: Harsha901/qwen2.5-coder-3b-distilled-from-14b-merged API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Harsha901

Model Overview

This model, Harsha901/qwen2.5-coder-3b-distilled-from-14b-merged, is a 3.1 billion parameter Qwen2.5-Coder-3B-Instruct variant. It has been significantly improved through Generalized Knowledge Distillation (GKD) from the Qwen2.5-Coder-14B-Instruct teacher model. The distillation process involved training a LoRA adapter using TRL's DistillationTrainer on a Python code instruction dataset, which was then merged back into the base weights.

Key Capabilities & Performance

Enhanced Code Generation: Achieves an 83.5% pass@1 on the HumanEval benchmark for Python programming tasks, representing a +2.44 percentage point improvement over the base Qwen2.5-Coder-3B-Instruct model (81.1%).
Reasoning Improvement: Shows gains in reasoning-heavy tasks such as string manipulation, cipher encoding, and date parsing.
Efficient Deployment: As a merged model, it loads exactly like the original 3B model without requiring adapter plumbing, making it straightforward to use.
Memory Footprint: Requires approximately 6.5 GB VRAM in bf16 precision or 2.5 GB in 4-bit NF4 quantization, making it suitable for consumer GPUs.

Training Details

The model was trained for 300 steps on the iamtarun/python_code_instructions_18k_alpaca dataset using a single NVIDIA A100 80 GB GPU. The GKD loss function incorporated both on-policy (25%) and teacher-forced (75%) sequences with a symmetric Jensen–Shannon divergence. The student model was Qwen/Qwen2.5-Coder-3B-Instruct with LoRA (r=16, α=32), and the teacher was Qwen/Qwen2.5-Coder-14B-Instruct loaded in 4-bit NF4.

Limitations & Future Work

Limited Training Steps: Only 300 steps were completed; longer training is expected to yield further improvements.
Arithmetic Edge Cases: Experienced slight regressions on precise numeric/sequence edge cases.
Python-only Training: Generalization to other programming languages is untested.
Distillation Dataset: The training dataset focuses on general Python instruction-following, not competitive algorithmic problems like HumanEval.

Overview

Model Overview

Key Capabilities & Performance

Training Details

Limitations & Future Work

Full Model Card (README)