RohitUltimate/Qwen3.5_VL_2B_12k
RohitUltimate/Qwen3.5_VL_2B_12k is a 2.3 billion parameter vision-language model based on Qwen3.5-2B, fine-tuned for image-text-to-text tasks. It features an extended context length of 32768 tokens and is optimized for instruction-following and multimodal understanding. This model is specifically aligned for bank statement extraction and designed for efficient deployment on GPUs with under 8GB VRAM.
Loading preview...
Model Overview
RohitUltimate/Qwen3.5_VL_2B_12k is a specialized vision-language model, building upon the Qwen3.5-2B architecture. It has been meticulously fine-tuned to excel in image-text-to-text tasks, offering enhanced instruction-following and multimodal understanding capabilities.
Key Capabilities
- Vision-Language Integration: Processes both image and text inputs to generate text outputs.
- Extended Context Window: Supports a substantial context length of 32768 tokens, allowing for processing longer inputs and maintaining conversational coherence.
- Optimized for Specific Tasks: Demonstrates improved performance in instruction-following and multimodal understanding, particularly aligned for bank statement extraction.
- Efficient Deployment: Designed to operate effectively on GPUs with less than 8GB VRAM, making it suitable for cost-effective and resource-constrained environments.
Deployment
The model can be efficiently served using the vLLM inference pipeline, which is known for its high throughput and memory efficiency. This allows for robust deployment even with its extended context capabilities.
Use Cases
This model is particularly well-suited for applications requiring:
- Automated extraction and analysis of information from bank statements.
- Multimodal instruction-following where both visual and textual cues are critical.
- Applications needing a powerful yet VRAM-efficient vision-language model with a long context window.