Name: microsoft/Phi-3.5-vision-instruct API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: microsoft

Overview

microsoft/Phi-3.5-vision-instruct is a 4.1 billion parameter open multimodal model from Microsoft's Phi-3 family, capable of processing both text and image inputs with a 128K token context length. It was trained on a diverse dataset including synthetic data, filtered public websites, and high-quality reasoning-dense data for both text and vision. The model underwent supervised fine-tuning and direct preference optimization for instruction adherence and safety. A recent update enhanced its multi-frame image understanding, improving performance on benchmarks like MMMU, MMBench, and TextVQA.

Key Capabilities

Multimodal Input: Processes both text and multiple images.
Extended Context: Supports a 128K token context length for comprehensive understanding.
Multi-frame Reasoning: Excels at detailed image comparison, multi-image summarization, and video summarization.
Document Intelligence: Strong performance in optical character recognition (OCR) and chart/table understanding.
Optimized Performance: Designed for memory/compute constrained and latency-bound environments.

Good For

General image understanding and analysis.
Applications requiring optical character recognition (OCR).
Interpreting and extracting information from charts, tables, and diagrams.
Comparing and reasoning across multiple images or video clips.
Developing AI systems where computational resources or response times are critical.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)