patrickamadeus/Qwen2.5-VL-3B-Instruct-ft

VISIONConcurrency Cost:1Model Size:3BQuant:BF16Ctx Length:32kPublished:May 23, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

patrickamadeus/Qwen2.5-VL-3B-Instruct-ft is a 3 billion parameter Qwen2.5-VL-based multimodal instruction-tuned model, converted from patrickamadeus/qwen2_5vl-1000. This model supports both text-only and image-plus-text inference, making it suitable for visual question answering and image description tasks. It offers a 32K context length and is designed for direct use with standard Transformers API for multimodal applications.

Loading preview...

Overview

patrickamadeus/Qwen2.5-VL-3B-Instruct-ft is a 3 billion parameter multimodal instruction-tuned model built on the Qwen2.5-VL architecture. It is a converted checkpoint from patrickamadeus/qwen2_5vl-1000, designed for seamless integration with the standard Hugging Face Transformers API without requiring custom wrappers.

Key Capabilities

  • Multimodal Understanding: Processes both text and image inputs, enabling visual question answering and image description generation.
  • Instruction Following: Fine-tuned to follow instructions for various tasks, making it adaptable to different prompts.
  • Standard API Compatibility: Loadable and usable directly with transformers and qwen-vl-utils, simplifying development and deployment.
  • Efficient Inference: With 3 billion parameters, it offers a balance between performance and computational efficiency for multimodal tasks.

Good For

  • Visual Question Answering (VQA): Answering questions based on provided images.
  • Image Captioning: Generating descriptive text for images.
  • Multimodal Chatbots: Developing conversational agents that can interact with users using both text and visual information.
  • Research and Development: Experimenting with multimodal large language models in a readily accessible format.