PromptCoT-2.0-SelfPlay-4B: Autonomous Reasoning at 4B Scale
This 4 billion parameter model, developed as part of PromptCoT 2.0, introduces a significant advancement in LLM reasoning through self-play training. Unlike traditional methods, it autonomously improves by generating and solving synthetic math and code problems, using verifiable feedback (e.g., unit test success, correct boxed answers) to construct positive and negative training pairs.
Key Capabilities & Innovations
- Self-Play Training with DPO: Utilizes Direct Preference Optimization (DPO) to align model generations with automatically verified outcomes, eliminating the need for supervision from larger, more powerful models.
- Synthetic Problem Generation: Leverages PromptCoT 2.0 to create a continuous stream of diverse and challenging math and code problems.
- State-of-the-Art 4B Performance: Establishes new benchmarks for 4B models, consistently outperforming strong open-source baselines like Qwen3-4B-Thinking-2507 and curated datasets such as OpenMathReasoning and OpenCodeReasoning.
- Exceptional Reasoning Gains: Demonstrates substantial improvements on high-difficulty tasks, including a +11.0 point gain on HMMT and +82 Elo on Codeforces compared to the next best models.
Ideal Use Cases
- Mathematical Problem Solving: Excels in complex math reasoning, as evidenced by top scores on AIME and HMMT benchmarks.
- Code Generation & Debugging: Highly effective for coding tasks, achieving leading performance on LiveCodeBench and Codeforces.
- Autonomous AI Development: Suitable for applications requiring models that can learn and improve reasoning capabilities without constant human oversight or reliance on external teachers.