Model Overview

CM/Qwen2.5-1.5B-Open-R1-Code-GRPO is a 1.5 billion parameter language model, fine-tuned by CM from the base Qwen/Qwen2.5-1.5B-Instruct architecture. It is designed for code-related tasks, specifically trained on the open-r1/verifiable-coding-problems-python-10k dataset.

Key Capabilities

Code Generation: Specialized in generating Python code for verifiable problems.
GRPO Training: Utilizes the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its reasoning and problem-solving abilities in a coding context.
Context Length: Supports a substantial context window of 32768 tokens, allowing for processing longer code snippets or problem descriptions.

Training Details

The model was trained using the TRL (Transformer Reinforcement Learning) framework. The GRPO method, a key aspect of its training, is detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models".

Good For

Automated Code Generation: Generating Python code solutions for defined problems.
Coding Assistance: Aiding developers by providing code suggestions or completing functions.
Educational Tools: Creating verifiable coding exercises or solutions.

Overview

Model Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)