microsoft/Phi-3-vision-128k-instruct

VISIONConcurrency Cost:1Model Size:4.1BQuant:BF16Ctx Length:32kPublished:May 19, 2024License:mitArchitecture:Transformer1.0K Open Weights Cold

The Microsoft Phi-3-Vision-128K-Instruct is a 4.1 billion parameter multimodal model from the Phi-3 family, designed for both text and vision inputs with a 128K token context length. It is built upon high-quality, reasoning-dense synthetic and filtered public datasets. This model is optimized for general image understanding, OCR, and chart/table understanding in memory/compute-constrained and latency-bound environments.

Loading preview...

Model Overview

Microsoft's Phi-3-Vision-128K-Instruct is a 4.1 billion parameter multimodal model within the Phi-3 family, capable of processing both text and image inputs. It features a substantial 128K token context length, making it suitable for complex, long-context tasks. The model was trained on a diverse dataset including synthetic data, filtered public websites, and high-quality image-text interleave, with a focus on reasoning-dense content. It underwent supervised fine-tuning and direct preference optimization to enhance instruction adherence and safety.

Key Capabilities

  • Multimodal Input: Processes both text and single image inputs effectively.
  • Extended Context: Supports a 128K token context length for comprehensive understanding.
  • Instruction Following: Enhanced through rigorous fine-tuning and direct preference optimization.
  • General Image Understanding: Excels in interpreting visual information.
  • OCR & Chart/Table Understanding: Specifically designed for optical character recognition and data visualization interpretation.

Intended Use Cases

This model is ideal for commercial and research applications requiring visual and text input capabilities, particularly in:

  • Memory or compute-constrained environments.
  • Latency-bound scenarios.
  • General image understanding tasks.
  • OCR and extraction from images.
  • Analysis of charts, tables, and diagrams.

Performance Highlights

Phi-3-Vision-128K-Instruct demonstrates competitive performance across various multimodal benchmarks, including MMMU (40.4), MMBench (80.5), ScienceQA (90.8), and MathVista (44.5), often outperforming similarly sized models like LLaVA-1.6 Vicuna-7B and QWEN-VL Chat. It is designed to accelerate research in efficient language and multimodal models.