zhaohq/PureRL-1.5B-v6d5-lam01-sigmoid-maskon-acc10

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 18, 2026Architecture:Transformer Warm

PureRL-1.5B-v6d5-lam01-sigmoid-maskon-acc10 is a 1.5 billion parameter language model developed by zhaohq, fine-tuned from Qwen/Qwen2.5-Math-1.5B. This model specializes in mathematical reasoning, leveraging the GRPO training method. It is designed for tasks requiring advanced mathematical problem-solving capabilities, building upon its Qwen2.5-Math base.

Loading preview...

Overview

PureRL-1.5B-v6d5-lam01-sigmoid-maskon-acc10 is a 1.5 billion parameter language model, fine-tuned by zhaohq from the Qwen/Qwen2.5-Math-1.5B base model. It was trained using the TRL framework.

Key Capabilities

  • Mathematical Reasoning: This model is specifically enhanced for mathematical tasks, building on its Qwen2.5-Math foundation.
  • GRPO Training: It utilizes the GRPO (Generalized Reinforcement Learning for Policy Optimization) method, as introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This training approach aims to improve its performance in complex mathematical problem-solving.

Training Details

  • Base Model: Fine-tuned from Qwen/Qwen2.5-Math-1.5B.
  • Frameworks: Trained with TRL (version 0.16.0.dev0), Transformers (4.48.3), Pytorch (2.5.1), Datasets (4.0.0), and Tokenizers (0.21.1).

Good For

  • Applications requiring strong mathematical reasoning abilities.
  • Research and development in reinforcement learning for language models, particularly those exploring GRPO.