stepfun-ai/Step-3.5-Flash

Hugging Face
TEXT GENERATIONConcurrency Cost:4Model Size:199BQuant:FP8Ctx Length:32kPublished:Feb 1, 2026License:apache-2.0Architecture:Transformer0.8K Open Weights Warm

Step 3.5 Flash by stepfun-ai is a 196.81 billion parameter sparse Mixture-of-Experts (MoE) model, activating approximately 11 billion parameters per token. It features a 256K context window and is engineered for frontier reasoning and agentic capabilities with exceptional efficiency. The model excels in coding and complex, multi-step reasoning tasks, achieving high throughput for real-time interaction and robust performance on benchmarks like SWE-bench Verified and Terminal-Bench 2.0.

Loading preview...

Step 3.5 Flash: Frontier Reasoning and Agentic Capabilities

Step 3.5 Flash, developed by stepfun-ai, is a powerful open-source foundation model built on a sparse Mixture of Experts (MoE) architecture. While it boasts 196.81 billion total parameters, it efficiently activates only ~11 billion parameters per token, allowing it to deliver deep reasoning comparable to top-tier proprietary models with high agility.

Key Capabilities

  • Deep Reasoning at Speed: Utilizes 3-way Multi-Token Prediction (MTP-3) to achieve generation throughputs of 100–300 tok/s (peaking at 350 tok/s for coding), enabling complex, multi-step reasoning chains with immediate responsiveness.
  • Robust Engine for Coding & Agents: Purpose-built for agentic tasks with a scalable RL framework, achieving 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0.
  • Efficient Long Context: Supports a cost-efficient 256K context window through a hybrid 3:1 Sliding Window Attention (SWA) ratio, reducing computational overhead.
  • Accessible Local Deployment: Optimized for secure local inference on high-end consumer hardware, ensuring data privacy without sacrificing performance.

Good for

  • Agentic AI applications requiring fast, complex, multi-step reasoning.
  • Coding tasks and long-horizon problem-solving in development environments.
  • Real-time interactive systems where high generation throughput is critical.
  • Local deployments on consumer-grade hardware for privacy-sensitive or offline use cases.