Name: jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: jaygala24

Overview

This model, jaygala24/Qwen2.5-3B-GRPO-KL-math-reasoning, is a specialized fine-tune of the Qwen2.5-3B base model. Its primary distinction lies in its training methodology: it employs Group Relative Policy Optimization (GRPO) with a KL penalty to significantly improve performance on mathematical reasoning tasks. The training leveraged the PipelineRL framework and included datasets such as gsm8k_train and math_train.

Key Capabilities & Performance

Enhanced Mathematical Reasoning: Specifically optimized for solving arithmetic and complex mathematical problems.
GRPO with KL Penalty: Utilizes an advanced reinforcement learning algorithm for fine-tuning, focusing on policy optimization with a KL divergence constraint.
Strong Benchmark Results: Achieves an overall pass@1 score of 79.96% across GSM8K and MATH-500 datasets, with pass@32 reaching 96.15%.
- GSM8K (test): 85.60% pass@1, 97.95% pass@32.
- MATH-500: 65.11% pass@1, 91.40% pass@32.
Sequence Length: Supports a sequence length of 8192 tokens, suitable for detailed problem-solving steps.

When to Use This Model

Mathematical Problem Solving: Ideal for applications requiring accurate and step-by-step solutions to math problems.
Educational Tools: Can be integrated into platforms for tutoring or generating math exercises.
Research in RL for LLMs: Provides a practical example of GRPO application for specific task improvement.

Overview

Overview

Key Capabilities & Performance

When to Use This Model

Full Model Card (README)