Kwaipilot/SRPO-Qwen-32B

TEXT GENERATIONConcurrency Cost:2Model Size:32.8BQuant:FP8Ctx Length:32kPublished:Apr 21, 2025License:mitArchitecture:Transformer0.0K Open Weights Cold

The Kwaipilot/SRPO-Qwen-32B model is a 32.8 billion parameter language model developed by Kwaipilot, based on the Qwen2.5 architecture. It utilizes a novel two-staged history-resampling policy optimization (SRPO) framework to achieve superior cross-domain reasoning performance in both mathematical and coding tasks. This model demonstrates enhanced capabilities in complex problem-solving, including self-reflection and code-based verification, outperforming DeepSeek-R1-Zero-32B with significantly fewer training steps.

Loading preview...

Overview of Kwaipilot/SRPO-Qwen-32B

Kwaipilot/SRPO-Qwen-32B is a 32.8 billion parameter model developed by Kwaipilot, implementing a novel Reinforcement Learning (RL) framework called SRPO (two-Staged history-Resampling Policy Optimization). This framework is specifically designed to tackle large-scale, multi-domain reasoning challenges, particularly in mathematics and coding.

Key Capabilities and Innovations

  • Superior Cross-Domain Reasoning: SRPO-Qwen-32B surpasses DeepSeek-R1-Zero-32B on both the AIME24 mathematical benchmark and the LiveCodeBench coding benchmark, achieving 50.0 Pass@1 for AIME24 and 41.6 Pass@1 for LiveCodeBench.
  • Efficient Training: Achieves these results with approximately 1/10th of the training steps compared to DeepSeek-R1-Zero-32B.
  • Two-Stage Cross-Domain Training: Employs a unique training paradigm to balance and integrate mathematical reasoning and coding proficiency. Stage 1 focuses on mathematical data to develop Chain-of-Thought (CoT) capabilities, while Stage 2 introduces coding data to build programming skills.
  • History Resampling (HR): A technique that improves training efficiency by filtering out uninformative samples and retaining those that provide effective gradient signals, enhancing the growth of response length.
  • Emergent Thinking Behaviors: During RL training, the model develops advanced cognitive processes such as self-reflection, correction, and backtracking. It also demonstrates the ability to spontaneously use code to verify mathematical solutions, showcasing strong cross-domain skill integration.

Ideal Use Cases

  • Complex Mathematical Problem Solving: Excels in tasks requiring detailed, step-by-step mathematical reasoning.
  • Code Generation and Problem Solving: Highly effective for generating correct Python programs based on specifications and passing tests.
  • Applications Requiring Integrated Reasoning: Suitable for scenarios where problems benefit from both logical mathematical deduction and programmatic verification or implementation.