philschmid/qwen-2.5-3b-r1-countdown
The philschmid/qwen-2.5-3b-r1-countdown model is a 3.1 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. It was specifically trained using TRL and GRPO on the Countdown game dataset. This model excels at mathematical reasoning and problem-solving tasks, particularly those involving arithmetic operations to reach a target number.
Loading preview...
Model Overview
philschmid/qwen-2.5-3b-r1-countdown is a specialized 3.1 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-3B-Instruct base model. Its primary focus is on mathematical reasoning, specifically designed to solve problems similar to the Countdown game, where the goal is to use a given set of numbers and basic arithmetic operations to reach a target number.
Key Capabilities
- Mathematical Reasoning: Demonstrates proficiency in solving arithmetic puzzles, showing step-by-step thought processes.
- GRPO Training: Utilizes the GRPO (Guided Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its reasoning abilities.
- Instruction Following: Capable of following detailed instructions for problem-solving, including structured output formats like
<think>and<answer>tags.
Training Details
This model was trained using the TRL library and the GRPO technique on a dataset derived from the Countdown game. The training procedure aims to replicate the "aha" moment of mathematical discovery, as detailed in a blog post by the model's creator. It leverages a 32768 token context length, allowing for complex reasoning chains.