joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3
The joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3 is a 1.5 billion parameter Qwen2.5-Coder model, fine-tuned by joshuasundance using Direct Preference Optimization (DPO). It specializes in generating Python code with comprehensive type annotations, achieving high `mypy --strict` pass rates and annotation slot coverage. This model is particularly optimized for in-domain type-hinting preferences rather than general code generation performance.
Loading preview...
Detailed Summary for mypo-qwen2.5-coder-1.5b-dpo-v3
This model is a 1.5 billion parameter Qwen2.5-Coder variant, specifically fine-tuned by joshuasundance using Direct Preference Optimization (DPO) to prioritize Python code generation with full type annotations. It builds upon a Qwen2.5-Coder-1.5B-Instruct base model that was first instruction-tuned (SFT) and then further optimized with DPO.
Key Capabilities & Differentiators
- Strong Type-Hinting Preference: Achieves a 92.0%
mypy --strictpass rate and 96.3% annotation slot coverage on batched evaluations, and 73.3%mypy --strictpass rate with 97.6% annotation slot coverage in single-prompt inference, significantly outperforming its base model (0% in both cases). - Preference-Tuned: Demonstrates a 52.7% preference win-rate against gold-standard type-hinted code, indicating a strong alignment with desired output style.
- Fully Merged Model: Shipped as a standalone model, eliminating the need for PEFT dependencies for deployment.
- Reproducible Training: All training scripts, data, and evaluation artifacts are publicly available, ensuring full transparency and reproducibility.
Use Cases & Limitations
Good for:
- Generating Python code where explicit and comprehensive type annotations are a primary requirement.
- Developers seeking a smaller, specialized model for type-hinted code generation tasks.
- Integrating into workflows that benefit from highly annotated Python code for improved readability, maintainability, and static analysis.
Not ideal for:
- General-purpose code generation where type-hinting is not a priority, as its HumanEval+ performance (58.5% pass@1 base tests) is lower than the original Qwen base model.
- Tasks requiring support for languages other than English and Python.
- Scenarios where strict adherence to
ruffstyle guidelines is paramount, as it shows a slight regression inruffpass rates compared to its SFT predecessor.