xl-zhao/PromptCoT-2.0-SelfPlay-4B
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Sep 25, 2025License:mitArchitecture:Transformer Open Weights Warm

PromptCoT-2.0-SelfPlay-4B is a 4 billion parameter model developed by Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, and Lingpeng Kong, trained via a novel self-play mechanism using Direct Preference Optimization (DPO). It leverages synthetic math and code problems from PromptCoT 2.0 to provide verifiable feedback, enabling autonomous improvement without stronger external teachers. This model achieves state-of-the-art performance at the 4B scale for reasoning tasks, particularly in mathematics and code generation, outperforming existing open-source baselines and curated datasets across six benchmarks.

Loading preview...