abhi14/test-grpo-delete-me
The abhi14/test-grpo-delete-me model is a 1.5 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-1.5B-Instruct. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring improved mathematical problem-solving and logical deduction.
Loading preview...
Overview
This model, abhi14/test-grpo-delete-me, is a 1.5 billion parameter language model fine-tuned from the Qwen/Qwen2.5-1.5B-Instruct base model. Its development utilized the TRL framework for training.
Key Differentiator: GRPO Training
A significant aspect of this model is its training methodology, which incorporates GRPO (Gradient-based Reward Policy Optimization). GRPO is a method introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This suggests an optimization focus on enhancing the model's capabilities in mathematical reasoning and problem-solving.
Training Frameworks
- TRL: Version 1.2.0
- Transformers: Version 5.6.2
- Pytorch: Version 2.11.0
- Datasets: Version 4.8.4
- Tokenizers: Version 0.22.2
Potential Use Cases
Given its fine-tuning from an instruction-following model and the application of GRPO, this model is likely well-suited for:
- Tasks requiring mathematical reasoning.
- Instruction-following in contexts that benefit from logical deduction.
- Applications where a smaller, specialized model for numerical or logical problems is preferred.