joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3

TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Apr 22, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The joshuasundance/mypo-qwen2.5-coder-1.5b-dpo-v3 is a 1.5 billion parameter Qwen2.5-Coder model, fine-tuned by joshuasundance using Direct Preference Optimization (DPO). It specializes in generating Python code with comprehensive type annotations, achieving high `mypy --strict` pass rates and annotation slot coverage. This model is particularly optimized for in-domain type-hinting preferences rather than general code generation performance.

Loading preview...

Detailed Summary for mypo-qwen2.5-coder-1.5b-dpo-v3

This model is a 1.5 billion parameter Qwen2.5-Coder variant, specifically fine-tuned by joshuasundance using Direct Preference Optimization (DPO) to prioritize Python code generation with full type annotations. It builds upon a Qwen2.5-Coder-1.5B-Instruct base model that was first instruction-tuned (SFT) and then further optimized with DPO.

Key Capabilities & Differentiators

  • Strong Type-Hinting Preference: Achieves a 92.0% mypy --strict pass rate and 96.3% annotation slot coverage on batched evaluations, and 73.3% mypy --strict pass rate with 97.6% annotation slot coverage in single-prompt inference, significantly outperforming its base model (0% in both cases).
  • Preference-Tuned: Demonstrates a 52.7% preference win-rate against gold-standard type-hinted code, indicating a strong alignment with desired output style.
  • Fully Merged Model: Shipped as a standalone model, eliminating the need for PEFT dependencies for deployment.
  • Reproducible Training: All training scripts, data, and evaluation artifacts are publicly available, ensuring full transparency and reproducibility.

Use Cases & Limitations

Good for:

  • Generating Python code where explicit and comprehensive type annotations are a primary requirement.
  • Developers seeking a smaller, specialized model for type-hinted code generation tasks.
  • Integrating into workflows that benefit from highly annotated Python code for improved readability, maintainability, and static analysis.

Not ideal for:

  • General-purpose code generation where type-hinting is not a priority, as its HumanEval+ performance (58.5% pass@1 base tests) is lower than the original Qwen base model.
  • Tasks requiring support for languages other than English and Python.
  • Scenarios where strict adherence to ruff style guidelines is paramount, as it shows a slight regression in ruff pass rates compared to its SFT predecessor.