Name: ishagarg1103/countdown-qwen2.5-3b-grpo-mi300x API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: ishagarg1103

Overview

This model, ishagarg1103/countdown-qwen2.5-3b-grpo-mi300x, is a 3.1 billion parameter Qwen2.5 base model that has undergone specialized training using Guided Reinforcement Learning with Policy Optimization (GRPO). The primary objective of this training was to investigate if GRPO could induce Countdown-solving and search behaviors within the model, and to assess the transferability of training from 3-number problems to more complex 4-number problems.

Key Capabilities

Countdown Problem Solving: The model is specifically fine-tuned to tackle numerical Countdown-style puzzles, where a target number must be reached using a set of given numbers and arithmetic operations.
Search Behavior Elicitation: Training with GRPO successfully elicited search-like behaviors, as evidenced by the presence of "try", "too big", "combine", and "target" markers in correct rollouts.
Performance on 3-Number Tasks: Achieved a 68.46% solve rate on held-out 3-number Countdown tasks, demonstrating strong proficiency in simpler problem instances.
Transfer Learning to 4-Number Tasks: While trained on 3-number tasks, the model showed some transfer capability, achieving a 15.08% solve rate on held-out 4-number tasks, indicating potential for generalization to slightly more complex problems.

Good For

Research into Reinforcement Learning for Reasoning: Ideal for researchers exploring how GRPO can be used to instill specific problem-solving and search strategies in large language models.
Numerical Puzzle Solving: Suitable for applications requiring a model to perform arithmetic reasoning and explore solution spaces for numerical puzzles.
Understanding Model Generalization: Useful for studying the limits and effectiveness of transfer learning from simpler to more complex problem variations within a specific domain.

Overview

Overview

Key Capabilities

Good For

Full Model Card (README)