pranavvmurthy26/Qwen3-VL-2B-Instruct-Docling-5K-30perc-11ep
pranavvmurthy26/Qwen3-VL-2B-Instruct-Docling-5K-30perc-11ep is a 2 billion parameter vision-language model, fine-tuned from Qwen/Qwen3-VL-2B-Instruct, specifically optimized for document understanding tasks. It leverages the Qwen3-VL architecture, which features advanced visual perception, reasoning, and an extended context length of 32768 tokens. This model is particularly suited for applications requiring detailed analysis and interaction with visual documents, building upon its base model's capabilities in visual agent operations, spatial perception, and enhanced multimodal reasoning.
Loading preview...
Qwen3-VL-2B-Instruct-Docling-5K-30perc-11ep Overview
This model is a fine-tuned version of the Qwen/Qwen3-VL-2B-Instruct, a 2 billion parameter vision-language model from the Qwen series. It has been specifically trained for 10 epochs on 30% of the original image size using the pranavvmurthy26/DoclingMatix_5K dataset, indicating a specialization in document understanding and visual question answering related to documents.
Key Capabilities & Enhancements
Building on the foundational Qwen3-VL architecture, this model inherits and potentially enhances:
- Superior Text Understanding & Generation: Seamless fusion of text and vision for comprehensive comprehension.
- Deeper Visual Perception & Reasoning: Advanced capabilities in interpreting visual information.
- Extended Context Length: Supports a native 256K context, expandable to 1M, allowing for processing of long documents and videos.
- Enhanced Multimodal Reasoning: Excels in complex tasks like STEM/Math, providing logical and evidence-based answers.
- Upgraded Visual Recognition: Broad and high-quality pretraining enables recognition of diverse entities.
- Expanded OCR: Supports 32 languages and is robust in challenging conditions, with improved long-document structure parsing.
- Visual Agent Capabilities: Operates PC/mobile GUIs, recognizes elements, and completes tasks.
- Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos.
Architectural Innovations
The base Qwen3-VL model incorporates several architectural updates:
- Interleaved-MRoPE: Enhances long-horizon video reasoning through robust positional embeddings.
- DeepStack: Fuses multi-level ViT features for fine-grained details and improved image-text alignment.
- Text-Timestamp Alignment: Provides precise, timestamp-grounded event localization for stronger video temporal modeling.
Good For
- Document Analysis: Given its fine-tuning dataset, it's well-suited for tasks involving understanding and extracting information from documents.
- Visual Question Answering: Answering questions based on visual input, especially within document contexts.
- Multimodal Applications: Scenarios requiring a strong interplay between visual and textual data.