unsloth/Magistral-Small-2506

TEXT GENERATIONConcurrency Cost:2Model Size:24BQuant:FP8Ctx Length:32kPublished:Jun 10, 2025License:apache-2.0Architecture:Transformer0.0K Open Weights Cold

Magistral-Small-2506 is a 24 billion parameter language model developed by unsloth, built upon Mistral Small 3.1 with enhanced reasoning capabilities. It undergoes Supervised Fine-Tuning (SFT) from Magistral Medium traces and Reinforcement Learning (RL). This model is designed for efficient reasoning tasks, supporting dozens of languages, and can be deployed locally on hardware like an RTX 4090 or a 32GB RAM MacBook when quantized. It features a 128k context window, with optimal performance recommended up to 40k tokens.

Loading preview...

Magistral-Small-2506 Overview

Magistral-Small-2506 is a 24 billion parameter language model from unsloth, based on Mistral Small 3.1. It has been enhanced with added reasoning capabilities through Supervised Fine-Tuning (SFT) using Magistral Medium traces and subsequent Reinforcement Learning (RL). This model is optimized for complex reasoning tasks, capable of generating long chains of thought before providing an answer.

Key Features

  • Enhanced Reasoning: Specifically designed to perform detailed reasoning traces.
  • Multilingual Support: Supports a wide array of languages including English, French, German, Japanese, Chinese, and many others.
  • Flexible Deployment: Can be deployed locally, fitting on an RTX 4090 or a 32GB RAM MacBook after quantization.
  • Apache 2.0 License: Allows for broad commercial and non-commercial use and modification.
  • Extended Context Window: Features a 128k context window, with optimal performance recommended up to 40k tokens.

Performance Highlights

While Magistral Medium shows slightly higher scores, Magistral Small demonstrates strong performance across various benchmarks, including AIME24, AIME25, GPQA Diamond, and Livecodebench (v5).

Recommended Usage

For optimal results, users should employ specific sampling parameters: top_p: 0.95, temperature: 0.7, and max_tokens: 40960. The model also benefits from a recommended chat template that includes a system prompt for structured thinking processes, encouraging an inner monologue before generating a concise summary. Inference is recommended via vLLM, with community-prepared quantized versions available for llama.cpp, lmstudio, ollama, and unsloth.