zhaohq/PureRL-1.5B-v6d4-lam01-sigmoid-maskoff-acc05
The PureRL-1.5B-v6d4-lam01-sigmoid-maskoff-acc05 model by zhaohq is a 1.5 billion parameter language model, fine-tuned from Qwen/Qwen2.5-Math-1.5B with a 32768 token context length. It was trained using the GRPO method, a technique designed to enhance mathematical reasoning in language models. This model is specifically optimized for tasks requiring advanced mathematical problem-solving capabilities.
Loading preview...
Model Overview
The PureRL-1.5B-v6d4-lam01-sigmoid-maskoff-acc05 is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned variant of the Qwen/Qwen2.5-Math-1.5B base model, inheriting its 32768 token context length.
Key Capabilities & Training
This model's primary differentiator lies in its training methodology. It was fine-tuned using GRPO (Generalized Reinforcement Learning with Policy Optimization), a method introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This specialized training aims to significantly improve the model's performance on mathematical reasoning tasks.
Technical Details
- Base Model: Qwen/Qwen2.5-Math-1.5B
- Parameter Count: 1.5 Billion
- Context Length: 32768 tokens
- Training Framework: TRL (Transformer Reinforcement Learning)
- Training Method: GRPO, focused on mathematical reasoning enhancement.
Recommended Use Cases
This model is particularly well-suited for applications requiring robust mathematical problem-solving and reasoning. Its GRPO-based fine-tuning suggests improved capabilities in handling complex mathematical queries and generating accurate solutions, making it a strong candidate for educational tools, scientific research assistance, or any domain where precise mathematical understanding is critical.