patrickamadeus/Qwen2.5-VL-3B-Instruct-ft_lang

VISIONConcurrency Cost:1Model Size:3BQuant:BF16Ctx Length:32kPublished:May 23, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

patrickamadeus/Qwen2.5-VL-3B-Instruct-ft_lang is a 3.09 billion parameter multimodal instruction-tuned model based on the Qwen2.5-VL architecture, fine-tuned for language understanding. It supports both text-only and image-plus-text inference, making it suitable for visual question answering and image description tasks. This model is a converted checkpoint from patrickamadeus/qwen2_5vl-select-1000, offering standard Qwen2.5-VL / Transformers API compatibility.

Loading preview...

Overview

This model, patrickamadeus/Qwen2.5-VL-3B-Instruct-ft_lang, is a 3.09 billion parameter multimodal instruction-tuned model built upon the Qwen2.5-VL architecture. It is a converted checkpoint from patrickamadeus/qwen2_5vl-select-1000, designed to be fully compatible with the standard Hugging Face Transformers API for Qwen2.5-VL models.

Key Capabilities

  • Multimodal Understanding: Processes both text and image inputs, enabling tasks like visual question answering and image description.
  • Instruction Following: Fine-tuned to respond to instructions effectively, making it suitable for interactive applications.
  • Standard API Integration: Easily loadable and usable with transformers and qwen-vl-utils, simplifying development and deployment.
  • Efficient Inference: With 3.09 billion parameters, it offers a balance between performance and computational efficiency for multimodal tasks.

Good For

  • Visual Question Answering (VQA): Answering questions based on provided images.
  • Image Captioning: Generating descriptive text for images.
  • Multimodal Chatbots: Developing conversational agents that can interpret and respond to both textual and visual cues.
  • Research and Development: A base for further fine-tuning on specific multimodal datasets or applications.