Step 3.7 Flash: A High-Performance Vision-Language MoE Model

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model from StepFun, featuring a 196B-parameter language backbone and a 1.8B-parameter vision encoder. Designed for high-frequency production workloads, it activates around 11B parameters per token, achieving up to 400 tokens per second throughput. The model supports an extensive 256k context window and offers selectable reasoning levels (low, medium, high) to balance speed, cost, and cognitive depth.

Key Capabilities

Multimodal Perception and Verification: Achieves top-tier visual intelligence, leading on SimpleVQA (Search) with 79.2 and frontier parity on V* (Python) at 95.3. It accurately processes visual interfaces and can verify context for incomplete visual assets.
Workflow Integrity and Tool Orchestration: Leads the ClawEval-1.1 benchmark with 67.1, demonstrating high resistance to adversarial traps and strict adherence to system policies. It reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions.
Code Engineering: Secured second place on SWE-Bench PRO with 56.3, capable of tracing multi-file repositories, isolating bugs, and generating functional patches that pass automated unit tests.

Good For

Scaling agentic workflows combining perception, search, and reasoning.
Intensive tasks such as parsing massive financial reports or running multi-step search loops.
Operating concurrent coding agents in high-throughput pipelines.
Applications requiring robust visual grounding and retrieval-augmented reasoning.

Overview

Step 3.7 Flash: A High-Performance Vision-Language MoE Model

Key Capabilities

Good For

Full Model Card (README)