coder3101/Cydonia-24B-v4.3-vision-heretic
Hugging Face
VISIONConcurrency Cost:2Model Size:24BQuant:FP8Ctx Length:32kPublished:Feb 7, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The coder3101/Cydonia-24B-v4.3-vision-heretic model is a 24 billion parameter multimodal large language model with a 32768 token context length. Developed by coder3101, it integrates vision capabilities by grafting the Pixtral vision encoder and multimodal projector from Mistral-Small-3.2-24B-Instruct-2506 onto the Cydonia-24B-v4.3-heretic text model. This model is designed for tasks requiring both text and image understanding, leveraging its Mistral-based architecture for robust multimodal processing.

Loading preview...

Cydonia-24B-v4.3-Vision-Heretic Overview

This model, developed by coder3101, is a 24 billion parameter multimodal large language model built upon the Mistral3ForConditionalGeneration architecture. It extends the text-only coder3101/Cydonia-24B-v4.3-heretic by incorporating the Pixtral vision encoder and multimodal projector from mistralai/Mistral-Small-3.2-24B-Instruct-2506.

Key Capabilities & Features

  • Multimodal Understanding: Combines advanced text generation with image comprehension, enabling it to process and respond to prompts containing both text and images.
  • Inherited Text Quality: The text generation capabilities are identical to the base Cydonia-24B-v4.3-heretic model, which is a fine-tuned version of Mistral-Small-3.2-24B-Instruct-2506.
  • Vision Integration: Utilizes the original, unmodified vision weights from Mistral-Small-3.2-24B-Instruct-2506, grafted onto the Cydonia text weights.
  • Dual Weight Formats: Ships with weights in both Mistral-native consolidated.safetensors (recommended for optimal performance) and HuggingFace model.safetensors formats.

Important Considerations

  • Vision Quality: Due to the independent fine-tuning of the Cydonia text model, the vision understanding quality may differ from the original Mistral Small 3.2 model, as the vision components were not jointly trained with the Cydonia text decoder.
  • Resource Requirements: Running this model in bf16 or fp16 precision typically requires approximately 55 GB of GPU RAM.

Recommended Usage

This model is well-suited for applications requiring robust multimodal interaction, such as image description, visual question answering, and other tasks where both textual and visual context are crucial. It is recommended to use vLLM >= 0.9.1 for serving, leveraging the Mistral-native weight format for best results.