Devstral-Vision-Small-2507 Overview

Devstral-Vision-Small-2507 is a multimodal language model developed by Eric Hartford at Quixi AI, integrating the robust coding prowess of Devstral-Small-2507 with the visual comprehension of Mistral-Small-3.2-24B-Instruct-2506. This 24 billion parameter model features a 128k token context window and is engineered for advanced software engineering tasks that require visual context.

Key Capabilities

Vision-Augmented Coding: Analyzes screenshots, UI mockups, and designs to generate and modify code.
Debugging with Visuals: Facilitates debugging of visual rendering issues by interpreting actual screenshots.
Design-to-Code Conversion: Converts visual designs and wireframes directly into implementation code.
Superior Coding Performance: Inherits Devstral's strong performance on coding tasks, including multi-file editing and codebase exploration, achieving 53.6% on SWE-Bench Verified when used with OpenHands.
Robust Vision Understanding: Maintains Mistral-Small's capabilities in interpreting UI elements, layouts, charts, and diagrams.

Good for

Visual Software Engineering: Ideal for tasks like building UI components from screenshots or converting design mockups to code.
Code Review with Visual Context: Reviewing code changes alongside their visual output.
Debugging Visual Issues: Pinpointing and resolving rendering problems using visual feedback.
Agentic Coding Tasks: Optimized for use with frameworks like OpenHands for automated development workflows.

This model was created by surgically replacing the language model weights of Mistral-Small-3.2-24B-Instruct-2506 with those from Devstral-Small-2507, while preserving all vision components. It requires approximately 48GB of GPU memory for full precision or 24GB with 4-bit quantization.