Name: Kwaipilot/SRPO-Qwen-32B API
Brand: Featherless.ai
Price: 25.00 USD
Availability: InStock
Author: Kwaipilot

Overview of Kwaipilot/SRPO-Qwen-32B

Kwaipilot/SRPO-Qwen-32B is a 32.8 billion parameter model developed by Kwaipilot, implementing a novel Reinforcement Learning (RL) framework called SRPO (two-Staged history-Resampling Policy Optimization). This framework is specifically designed to tackle large-scale, multi-domain reasoning challenges, particularly in mathematics and coding.

Key Capabilities and Innovations

Superior Cross-Domain Reasoning: SRPO-Qwen-32B surpasses DeepSeek-R1-Zero-32B on both the AIME24 mathematical benchmark and the LiveCodeBench coding benchmark, achieving 50.0 Pass@1 for AIME24 and 41.6 Pass@1 for LiveCodeBench.
Efficient Training: Achieves these results with approximately 1/10th of the training steps compared to DeepSeek-R1-Zero-32B.
Two-Stage Cross-Domain Training: Employs a unique training paradigm to balance and integrate mathematical reasoning and coding proficiency. Stage 1 focuses on mathematical data to develop Chain-of-Thought (CoT) capabilities, while Stage 2 introduces coding data to build programming skills.
History Resampling (HR): A technique that improves training efficiency by filtering out uninformative samples and retaining those that provide effective gradient signals, enhancing the growth of response length.
Emergent Thinking Behaviors: During RL training, the model develops advanced cognitive processes such as self-reflection, correction, and backtracking. It also demonstrates the ability to spontaneously use code to verify mathematical solutions, showcasing strong cross-domain skill integration.

Ideal Use Cases

Complex Mathematical Problem Solving: Excels in tasks requiring detailed, step-by-step mathematical reasoning.
Code Generation and Problem Solving: Highly effective for generating correct Python programs based on specifications and passing tests.
Applications Requiring Integrated Reasoning: Suitable for scenarios where problems benefit from both logical mathematical deduction and programmatic verification or implementation.

Overview

Overview of Kwaipilot/SRPO-Qwen-32B

Key Capabilities and Innovations

Ideal Use Cases

Full Model Card (README)