nimishbongale/qwen-2.5-0.5b-grpo-rlcot-gsm8k

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:0.5BQuant:BF16Ctx Length:32kLicense:apache-2.0Architecture:Transformer0.0K Open Weights Warm

The nimishbongale/qwen-2.5-0.5b-grpo-rlcot-gsm8k is a 0.5 billion parameter model based on the Qwen 2.5 architecture, fine-tuned for mathematical reasoning. This model utilizes GRPO and RLCoT techniques, specifically targeting performance on the GSM8K dataset. It demonstrates potential for scaling to higher accuracy on complex arithmetic and word problems.

Loading preview...

nimishbongale/qwen-2.5-0.5b-grpo-rlcot-gsm8k Overview

This model is a compact 0.5 billion parameter variant of the Qwen 2.5 architecture, developed by nimishbongale. It has been specifically fine-tuned using a combination of Guided Reinforcement Learning with Policy Optimization (GRPO) and Reinforcement Learning from CoT (RLCoT) methods. The primary objective of this training regimen is to enhance its capabilities in mathematical reasoning, particularly on the GSM8K dataset.

Key Capabilities

  • Mathematical Reasoning: Optimized for solving arithmetic and word problems, as evidenced by its training on GSM8K.
  • Efficient Size: At 0.5 billion parameters, it offers a smaller footprint compared to larger models while still demonstrating specialized reasoning abilities.
  • Context Length: Supports a substantial context length of 131,072 tokens, beneficial for complex multi-step problems.
  • Training Potential: Shows promising scaling potential, with current training indicating further performance gains with extended epochs.

Good For

  • Educational Applications: Ideal for tasks requiring mathematical problem-solving in educational technology.
  • Research in RLCoT/GRPO: A valuable base model for researchers exploring advanced reinforcement learning techniques for reasoning tasks.
  • Resource-Constrained Environments: Its smaller size makes it suitable for deployment where computational resources are limited, but mathematical reasoning is required.