ahme0599/Qwen_Qwen2.5-1.5B-Instruct-GRPO-vanilla_G_4
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:Dec 15, 2025Architecture:Transformer Warm

The ahme0599/Qwen_Qwen2.5-1.5B-Instruct-GRPO-vanilla_G_4 is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. This model utilizes the GRPO (Gradient-based Reward Policy Optimization) method, originally introduced for mathematical reasoning in DeepSeekMath. It is optimized for instruction following tasks, leveraging its training with TRL to enhance its response generation capabilities.

Loading preview...