ishagarg1103/countdown-qwen2.5-3b-grpo-mi300x
The ishagarg1103/countdown-qwen2.5-3b-grpo-mi300x model is a 3.1 billion parameter Qwen2.5 base model fine-tuned using GRPO (Guided Reinforcement Learning with Policy Optimization) specifically for solving Countdown-style numerical problems. It demonstrates a 68.46% solve rate on 3-number tasks and a 15.08% solve rate on 4-number tasks, indicating its specialized capability in arithmetic problem-solving and search behavior. This model is optimized for numerical reasoning and exploring solution paths in constrained mathematical puzzles.
Loading preview...
Overview
This model, ishagarg1103/countdown-qwen2.5-3b-grpo-mi300x, is a 3.1 billion parameter Qwen2.5 base model that has undergone specialized training using Guided Reinforcement Learning with Policy Optimization (GRPO). The primary objective of this training was to investigate if GRPO could induce Countdown-solving and search behaviors within the model, and to assess the transferability of training from 3-number problems to more complex 4-number problems.
Key Capabilities
- Countdown Problem Solving: The model is specifically fine-tuned to tackle numerical Countdown-style puzzles, where a target number must be reached using a set of given numbers and arithmetic operations.
- Search Behavior Elicitation: Training with GRPO successfully elicited search-like behaviors, as evidenced by the presence of "try", "too big", "combine", and "target" markers in correct rollouts.
- Performance on 3-Number Tasks: Achieved a 68.46% solve rate on held-out 3-number Countdown tasks, demonstrating strong proficiency in simpler problem instances.
- Transfer Learning to 4-Number Tasks: While trained on 3-number tasks, the model showed some transfer capability, achieving a 15.08% solve rate on held-out 4-number tasks, indicating potential for generalization to slightly more complex problems.
Good For
- Research into Reinforcement Learning for Reasoning: Ideal for researchers exploring how GRPO can be used to instill specific problem-solving and search strategies in large language models.
- Numerical Puzzle Solving: Suitable for applications requiring a model to perform arithmetic reasoning and explore solution spaces for numerical puzzles.
- Understanding Model Generalization: Useful for studying the limits and effectiveness of transfer learning from simpler to more complex problem variations within a specific domain.