microsoft/Phi-3.5-vision-instruct

VISIONConcurrency Cost:1Model Size:4.1BQuant:BF16Ctx Length:32kPublished:Aug 16, 2024License:mitArchitecture:Transformer0.7K Open Weights Cold

microsoft/Phi-3.5-vision-instruct is a 4.1 billion parameter open multimodal model developed by Microsoft, featuring a 128K token context length. It is designed for general-purpose AI systems requiring both visual and text input, excelling in memory/compute constrained environments and latency-bound scenarios. This model is particularly strong in general image understanding, OCR, chart/table understanding, and multi-image reasoning, including summarization and comparison.

Loading preview...

Overview

microsoft/Phi-3.5-vision-instruct is a 4.1 billion parameter open multimodal model from Microsoft's Phi-3 family, capable of processing both text and image inputs with a 128K token context length. It was trained on a diverse dataset including synthetic data, filtered public websites, and high-quality reasoning-dense data for both text and vision. The model underwent supervised fine-tuning and direct preference optimization for instruction adherence and safety. A recent update enhanced its multi-frame image understanding, improving performance on benchmarks like MMMU, MMBench, and TextVQA.

Key Capabilities

  • Multimodal Input: Processes both text and multiple images.
  • Extended Context: Supports a 128K token context length for comprehensive understanding.
  • Multi-frame Reasoning: Excels at detailed image comparison, multi-image summarization, and video summarization.
  • Document Intelligence: Strong performance in optical character recognition (OCR) and chart/table understanding.
  • Optimized Performance: Designed for memory/compute constrained and latency-bound environments.

Good For

  • General image understanding and analysis.
  • Applications requiring optical character recognition (OCR).
  • Interpreting and extracting information from charts, tables, and diagrams.
  • Comparing and reasoning across multiple images or video clips.
  • Developing AI systems where computational resources or response times are critical.