stepfun-ai/Step-3.7-Flash

Hugging Face
VISIONConcurrency Cost:4Model Size:201.4BQuant:FP8Ctx Length:32kPublished:May 23, 2026License:apache-2.0Architecture:Transformer0.4K Open Weights Warm

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model developed by StepFun. It combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder, activating approximately 11B parameters per token for high efficiency. Supporting a 256k context window, this model is engineered for high-frequency production workloads and agentic workflows that require multimodal perception, robust tool orchestration, and code engineering capabilities. It excels at tasks like parsing large documents, multi-step search loops, and operating concurrent coding agents.

Loading preview...

Step 3.7 Flash: A High-Performance Vision-Language MoE Model

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model from StepFun, featuring a 196B-parameter language backbone and a 1.8B-parameter vision encoder. Designed for high-frequency production workloads, it activates around 11B parameters per token, achieving up to 400 tokens per second throughput. The model supports an extensive 256k context window and offers selectable reasoning levels (low, medium, high) to balance speed, cost, and cognitive depth.

Key Capabilities

  • Multimodal Perception and Verification: Achieves top-tier visual intelligence, leading on SimpleVQA (Search) with 79.2 and frontier parity on V* (Python) at 95.3. It accurately processes visual interfaces and can verify context for incomplete visual assets.
  • Workflow Integrity and Tool Orchestration: Leads the ClawEval-1.1 benchmark with 67.1, demonstrating high resistance to adversarial traps and strict adherence to system policies. It reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions.
  • Code Engineering: Secured second place on SWE-Bench PRO with 56.3, capable of tracing multi-file repositories, isolating bugs, and generating functional patches that pass automated unit tests.

Good For

  • Scaling agentic workflows combining perception, search, and reasoning.
  • Intensive tasks such as parsing massive financial reports or running multi-step search loops.
  • Operating concurrent coding agents in high-throughput pipelines.
  • Applications requiring robust visual grounding and retrieval-augmented reasoning.