yasserrmd/Coder-GRPO-3B

Warm
Public
3.1B
BF16
32768
1
Feb 8, 2025
License: apache-2.0
Hugging Face
Overview

Overview

Coder-GRPO-3B is a 3 billion parameter model developed by yasserrmd, fine-tuned from Qwen/Qwen2.5-3B-Instruct. It specializes in code reasoning and generation, aiming to produce concise, correct code and clear explanations. The model was trained using Group Relative Policy Optimization (GRPO) with Unsloth and TRL, focusing on high-signal code tasks.

Key Capabilities

  • Code Generation & Refactoring: Efficiently writes and improves code.
  • Bug Fixing: Identifies and fixes bugs with minimal changes.
  • Code Explanation: Provides clear and concise explanations for code.
  • Testing & Docstrings: Capable of writing tests and docstrings.
  • Lightweight Agent Use: Suitable for function calling and tool use.

Training Details

The model was fine-tuned on the glaiveai/glaive-code-assistant dataset, which includes code tasks with stepwise targets. Training emphasized short-horizon rewards for compilation, test passage, code style, and helpfulness. A notable feature is its use of a <think> block as an internal scratchpad, which the model is aligned to never reveal, ensuring concise and direct outputs. The model is designed to avoid generating secrets, credentials, or unsafe code patterns.