philschmid/qwen-2.5-3b-r1-countdown

TEXT GENERATIONConcurrency Cost:1Model Size:3.1BQuant:BF16Ctx Length:32kPublished:Jan 29, 2025Architecture:Transformer0.0K Cold

The philschmid/qwen-2.5-3b-r1-countdown model is a 3.1 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen2.5-3B-Instruct. It was specifically trained using TRL and GRPO on the Countdown game dataset. This model excels at mathematical reasoning and problem-solving tasks, particularly those involving arithmetic operations to reach a target number.

Loading preview...

Model Overview

philschmid/qwen-2.5-3b-r1-countdown is a specialized 3.1 billion parameter language model, fine-tuned from the Qwen/Qwen2.5-3B-Instruct base model. Its primary focus is on mathematical reasoning, specifically designed to solve problems similar to the Countdown game, where the goal is to use a given set of numbers and basic arithmetic operations to reach a target number.

Key Capabilities

  • Mathematical Reasoning: Demonstrates proficiency in solving arithmetic puzzles, showing step-by-step thought processes.
  • GRPO Training: Utilizes the GRPO (Guided Reinforcement Learning with Policy Optimization) method, as introduced in the DeepSeekMath paper, to enhance its reasoning abilities.
  • Instruction Following: Capable of following detailed instructions for problem-solving, including structured output formats like <think> and <answer> tags.

Training Details

This model was trained using the TRL library and the GRPO technique on a dataset derived from the Countdown game. The training procedure aims to replicate the "aha" moment of mathematical discovery, as detailed in a blog post by the model's creator. It leverages a 32768 token context length, allowing for complex reasoning chains.