Name: Blancy/DeepSeek-R1-Distill-Qwen-0.5B-GRPO API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: Blancy

Model Overview

Blancy/DeepSeek-R1-Distill-Qwen-0.5B-GRPO is a compact 0.5 billion parameter language model. It was fine-tuned using the TRL (Transformer Reinforcement Learning) library on the simplescaling/s1K-1.1 dataset.

Key Differentiator: GRPO Method

A core aspect of this model is its training with GRPO (Gradient-based Reward Policy Optimization). This method, first detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), is designed to significantly improve a model's reasoning abilities. By integrating GRPO, this model aims to achieve enhanced logical processing and problem-solving skills, particularly in structured or complex domains.

Training Details

The model's training procedure is publicly available for visualization via Weights & Biases. It leverages recent versions of key frameworks:

TRL: 0.15.2
Transformers: 4.49.0
Pytorch: 2.5.1
Datasets: 3.3.2

Potential Use Cases

Given its GRPO-enhanced training, this model is well-suited for applications that benefit from improved reasoning, such as:

Question Answering: Especially for questions requiring logical deduction.
Structured Data Interpretation: Analyzing and generating responses based on structured information.
Problem Solving: Tasks that involve breaking down problems and deriving solutions.

Overview

Model Overview

Key Differentiator: GRPO Method

Training Details

Potential Use Cases

Full Model Card (README)