thangvip/qwen2.5-1.5b-grpo-sgd-linear
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Feb 17, 2026Architecture:Transformer Warm

The thangvip/qwen2.5-1.5b-grpo-sgd-linear model is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper, which focuses on enhancing mathematical reasoning capabilities. This model is specifically optimized for tasks requiring improved reasoning, leveraging its specialized training procedure. With a context length of 32768 tokens, it is suitable for applications benefiting from advanced reasoning and mathematical problem-solving.

Loading preview...