Name: pranavvmurthy26/Qwen3-VL-2B-Instruct-Docling-5K-30perc-11ep API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: pranavvmurthy26

Qwen3-VL-2B-Instruct-Docling-5K-30perc-11ep Overview

This model is a fine-tuned version of the Qwen/Qwen3-VL-2B-Instruct, a 2 billion parameter vision-language model from the Qwen series. It has been specifically trained for 10 epochs on 30% of the original image size using the pranavvmurthy26/DoclingMatix_5K dataset, indicating a specialization in document understanding and visual question answering related to documents.

Key Capabilities & Enhancements

Building on the foundational Qwen3-VL architecture, this model inherits and potentially enhances:

Superior Text Understanding & Generation: Seamless fusion of text and vision for comprehensive comprehension.
Deeper Visual Perception & Reasoning: Advanced capabilities in interpreting visual information.
Extended Context Length: Supports a native 256K context, expandable to 1M, allowing for processing of long documents and videos.
Enhanced Multimodal Reasoning: Excels in complex tasks like STEM/Math, providing logical and evidence-based answers.
Upgraded Visual Recognition: Broad and high-quality pretraining enables recognition of diverse entities.
Expanded OCR: Supports 32 languages and is robust in challenging conditions, with improved long-document structure parsing.
Visual Agent Capabilities: Operates PC/mobile GUIs, recognizes elements, and completes tasks.
Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos.

Architectural Innovations

The base Qwen3-VL model incorporates several architectural updates:

Interleaved-MRoPE: Enhances long-horizon video reasoning through robust positional embeddings.
DeepStack: Fuses multi-level ViT features for fine-grained details and improved image-text alignment.
Text-Timestamp Alignment: Provides precise, timestamp-grounded event localization for stronger video temporal modeling.

Good For

Document Analysis: Given its fine-tuning dataset, it's well-suited for tasks involving understanding and extracting information from documents.
Visual Question Answering: Answering questions based on visual input, especially within document contexts.
Multimodal Applications: Scenarios requiring a strong interplay between visual and textual data.

Overview

Qwen3-VL-2B-Instruct-Docling-5K-30perc-11ep Overview

Key Capabilities & Enhancements

Architectural Innovations

Good For

Full Model Card (README)