Overview
lhkhiem28/Qwen2.5-3B-grpo is a 3.1 billion parameter language model, fine-tuned from the base Qwen/Qwen2.5-3B model. This fine-tuning process utilized the lhkhiem28/HA-GRPO-datasets dataset and was performed using the TRL framework.
Key Capabilities
- Enhanced Reasoning: The model incorporates the GRPO (Generative Reinforcement Learning with Policy Optimization) method, a technique highlighted in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." This suggests a focus on improving the model's ability to handle complex logical and mathematical problems.
- Instruction Following: As a fine-tuned model, it is designed to follow instructions effectively, making it suitable for interactive applications.
Training Details
The model was trained using TRL (Transformer Reinforcement Learning) and leverages the GRPO method. The training procedure can be further explored via the provided Weights & Biases link. Framework versions used include TRL 0.18.0.dev0, Transformers 4.52.0.dev0, Pytorch 2.6.0, Datasets 4.8.4, and Tokenizers 0.21.4.
Good For
- Applications requiring improved mathematical reasoning.
- Tasks benefiting from advanced logical problem-solving capabilities.
- Developers looking for a Qwen2.5-3B variant with specialized reasoning enhancements.