microsoft/Phi-3-vision-128k-instruct
The Microsoft Phi-3-Vision-128K-Instruct is a 4.1 billion parameter multimodal model from the Phi-3 family, designed for both text and vision inputs with a 128K token context length. It is built upon high-quality, reasoning-dense synthetic and filtered public datasets. This model is optimized for general image understanding, OCR, and chart/table understanding in memory/compute-constrained and latency-bound environments.
Loading preview...
Model Overview
Microsoft's Phi-3-Vision-128K-Instruct is a 4.1 billion parameter multimodal model within the Phi-3 family, capable of processing both text and image inputs. It features a substantial 128K token context length, making it suitable for complex, long-context tasks. The model was trained on a diverse dataset including synthetic data, filtered public websites, and high-quality image-text interleave, with a focus on reasoning-dense content. It underwent supervised fine-tuning and direct preference optimization to enhance instruction adherence and safety.
Key Capabilities
- Multimodal Input: Processes both text and single image inputs effectively.
- Extended Context: Supports a 128K token context length for comprehensive understanding.
- Instruction Following: Enhanced through rigorous fine-tuning and direct preference optimization.
- General Image Understanding: Excels in interpreting visual information.
- OCR & Chart/Table Understanding: Specifically designed for optical character recognition and data visualization interpretation.
Intended Use Cases
This model is ideal for commercial and research applications requiring visual and text input capabilities, particularly in:
- Memory or compute-constrained environments.
- Latency-bound scenarios.
- General image understanding tasks.
- OCR and extraction from images.
- Analysis of charts, tables, and diagrams.
Performance Highlights
Phi-3-Vision-128K-Instruct demonstrates competitive performance across various multimodal benchmarks, including MMMU (40.4), MMBench (80.5), ScienceQA (90.8), and MathVista (44.5), often outperforming similarly sized models like LLaVA-1.6 Vicuna-7B and QWEN-VL Chat. It is designed to accelerate research in efficient language and multimodal models.