RohitUltimate/Qwen3.5-2B_20K

VISIONConcurrency Cost:1Model Size:2.3BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 3, 2026License:mitArchitecture:Transformer0.0K Open Weights Cold

RohitUltimate/Qwen3.5-2B_20K is a 2.3 billion parameter vision-language model based on Qwen3.5-2B, fine-tuned for image-text-to-text tasks. It features an extended context length of 12,000 tokens and is optimized for instruction-following and multimodal understanding. This model is specifically aligned for bank statement extraction and designed for efficient deployment on GPUs with under 8GB VRAM.

Loading preview...

Model Overview

RohitUltimate/Qwen3.5-2B_20K is a 2.3 billion parameter vision-language model, building upon the Qwen3.5-2B architecture. It has been specifically fine-tuned for image-text-to-text tasks, demonstrating enhanced performance in instruction-following and multimodal understanding.

Key Capabilities

  • Vision-Language Integration: Processes both image and text inputs to generate text outputs.
  • Extended Context Window: Supports an impressive context length of 12,000 tokens, allowing for more comprehensive input processing.
  • Optimized for Bank Statement Extraction: Benefits from high-quality training data and alignment tailored for this specific application.
  • Efficient Deployment: Designed to run effectively on GPUs with less than 8GB VRAM, making it suitable for low-cost inference environments.
  • Improved Instruction Following: Shows better adherence to instructions compared to its base model.

When to Use This Model

This model is particularly well-suited for:

  • Applications requiring multimodal understanding where both visual and textual information are crucial.
  • Tasks involving the extraction of information from bank statements or similar document processing scenarios.
  • Deployments where GPU memory is limited (under 8GB VRAM) but robust vision-language capabilities are needed.
  • Use cases demanding an extended context window for processing longer or more complex inputs.