RohitUltimate/Qwen3.5-2B_20K
VISIONConcurrency Cost:1Model Size:2.3BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 3, 2026License:mitArchitecture:Transformer0.0K Open Weights Cold
RohitUltimate/Qwen3.5-2B_20K is a 2.3 billion parameter vision-language model based on Qwen3.5-2B, fine-tuned for image-text-to-text tasks. It features an extended context length of 12,000 tokens and is optimized for instruction-following and multimodal understanding. This model is specifically aligned for bank statement extraction and designed for efficient deployment on GPUs with under 8GB VRAM.
Loading preview...
Model Overview
RohitUltimate/Qwen3.5-2B_20K is a 2.3 billion parameter vision-language model, building upon the Qwen3.5-2B architecture. It has been specifically fine-tuned for image-text-to-text tasks, demonstrating enhanced performance in instruction-following and multimodal understanding.
Key Capabilities
- Vision-Language Integration: Processes both image and text inputs to generate text outputs.
- Extended Context Window: Supports an impressive context length of 12,000 tokens, allowing for more comprehensive input processing.
- Optimized for Bank Statement Extraction: Benefits from high-quality training data and alignment tailored for this specific application.
- Efficient Deployment: Designed to run effectively on GPUs with less than 8GB VRAM, making it suitable for low-cost inference environments.
- Improved Instruction Following: Shows better adherence to instructions compared to its base model.
When to Use This Model
This model is particularly well-suited for:
- Applications requiring multimodal understanding where both visual and textual information are crucial.
- Tasks involving the extraction of information from bank statements or similar document processing scenarios.
- Deployments where GPU memory is limited (under 8GB VRAM) but robust vision-language capabilities are needed.
- Use cases demanding an extended context window for processing longer or more complex inputs.