Blancy/DeepSeek-R1-Distill-Qwen-0.5B-GRPO
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kPublished:Apr 16, 2025Architecture:Transformer Warm
Blancy/DeepSeek-R1-Distill-Qwen-0.5B-GRPO is a 0.5 billion parameter language model, fine-tuned from an unspecified base model using the TRL library on the simplescaling/s1K-1.1 dataset. This model incorporates the GRPO (Gradient-based Reward Policy Optimization) method, originally introduced in DeepSeekMath, to enhance its reasoning capabilities. It is specifically optimized for tasks requiring robust logical processing, making it suitable for applications demanding precise and structured outputs.
Loading preview...