GLM-4.7-Flash Overview
GLM-4.7-Flash, developed by the GLM Team, is a 30 billion parameter Mixture-of-Experts (MoE) model positioned as a leading option for lightweight deployment. It aims to provide a strong balance between performance and computational efficiency, making it suitable for scenarios where resource optimization is crucial.
Key Capabilities and Performance
This model demonstrates strong performance across various benchmarks, particularly in areas related to agentic tasks, reasoning, and coding. Notable benchmark results include:
- AIME 25: 91.6
- GPQA: 75.2
- HLE: 14.4
- SWE-bench Verified: 59.2 (significantly outperforming Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B)
- τ²-Bench: 79.5 (also significantly higher than competitors)
- BrowseComp: 42.8 (a substantial lead over other models)
These results highlight its proficiency in complex problem-solving, particularly in coding and agentic reasoning tasks. For multi-turn agentic tasks like τ²-Bench and Terminal Bench 2, the model leverages a "Preserved Thinking mode" for enhanced performance.
Deployment and Usage
GLM-4.7-Flash supports local deployment using popular inference frameworks such as vLLM and SGLang, with comprehensive instructions available in the official Github repository. It can also be integrated via the Hugging Face Transformers library. Evaluation parameters for different tasks, such as temperature and max_new_tokens, are provided to optimize performance for specific use cases, including agentic and coding-focused scenarios.