apple/SimpleSD-4B-instruct

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 17, 2026License:apple-amlrArchitecture:Transformer0.0K Cold

The apple/SimpleSD-4B-instruct is a 4 billion parameter instruction-tuned language model developed by Apple, based on the Qwen architecture. It utilizes the Simple Self-Distillation (SimpleSD) method to significantly improve code generation capabilities by fine-tuning on its own sampled outputs. This model excels particularly in competitive programming benchmarks, demonstrating substantial gains over its base Qwen counterpart without requiring rewards or external teacher models. It is designed for enhanced performance in coding tasks, especially for harder problems.

Loading preview...

Overview

The apple/SimpleSD-4B-instruct is a 4 billion parameter instruction-tuned model developed by Apple, leveraging the Simple Self-Distillation (SimpleSD) method. This innovative approach enhances code generation by fine-tuning the model on its own sampled outputs, eliminating the need for rewards, verifiers, or external teacher models. The model is initialized using the Qwen architecture and focuses on improving performance in coding tasks.

Key Capabilities & Method

  • Improved Code Generation: SimpleSD samples solutions from the base model using non-unit temperature and top-k/top-p truncation, then fine-tunes on these samples via standard supervised learning.
  • Precision–Exploration Conflict Resolution: The method reshapes token distributions context-dependently, making a single global decoding configuration more effective at evaluation time.
  • Significant Performance Gains: On LiveCodeBench v6, this model shows a +7.5 pass@1 improvement and +15.8 pass@5 improvement over the base Qwen3-4B-Instruct-2507 model.
  • Research Checkpoint: This model serves as a research checkpoint for reproducibility of the SimpleSD method, as detailed in the paper: Embarrassingly Simple Self-Distillation Improves Code Generation.

When to Use This Model

This model is particularly well-suited for:

  • Code Generation Tasks: Especially for competitive programming problems where it demonstrates strong improvements.
  • Research and Experimentation: Ideal for exploring self-distillation techniques in code generation.

It's important to note that these are research checkpoints and not optimized Qwen releases, nor do they represent a broader open-source model strategy from Apple.