thangvip/qwen2.5-1.5b-grpo-no-sft-sgd-linear
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Feb 16, 2026Architecture:Transformer Warm

thangvip/qwen2.5-1.5b-grpo-no-sft-sgd-linear is a 1.5 billion parameter causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. This model utilizes the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, as introduced in DeepSeekMath, to enhance its capabilities. With a context length of 32768 tokens, it is optimized for tasks that benefit from advanced reasoning techniques. It is particularly suited for applications requiring improved mathematical or logical reasoning, building upon its base Qwen2.5 architecture.

Loading preview...