Name: microsoft/Phi-3-vision-128k-instruct API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: microsoft

Model Overview

Microsoft's Phi-3-Vision-128K-Instruct is a 4.1 billion parameter multimodal model within the Phi-3 family, capable of processing both text and image inputs. It features a substantial 128K token context length, making it suitable for complex, long-context tasks. The model was trained on a diverse dataset including synthetic data, filtered public websites, and high-quality image-text interleave, with a focus on reasoning-dense content. It underwent supervised fine-tuning and direct preference optimization to enhance instruction adherence and safety.

Key Capabilities

Multimodal Input: Processes both text and single image inputs effectively.
Extended Context: Supports a 128K token context length for comprehensive understanding.
Instruction Following: Enhanced through rigorous fine-tuning and direct preference optimization.
General Image Understanding: Excels in interpreting visual information.
OCR & Chart/Table Understanding: Specifically designed for optical character recognition and data visualization interpretation.

Intended Use Cases

This model is ideal for commercial and research applications requiring visual and text input capabilities, particularly in:

Memory or compute-constrained environments.
Latency-bound scenarios.
General image understanding tasks.
OCR and extraction from images.
Analysis of charts, tables, and diagrams.

Performance Highlights

Phi-3-Vision-128K-Instruct demonstrates competitive performance across various multimodal benchmarks, including MMMU (40.4), MMBench (80.5), ScienceQA (90.8), and MathVista (44.5), often outperforming similarly sized models like LLaVA-1.6 Vicuna-7B and QWEN-VL Chat. It is designed to accelerate research in efficient language and multimodal models.

Overview

Model Overview

Key Capabilities

Intended Use Cases

Performance Highlights

Full Model Card (README)