zhaohq/PureRL-1.5B-v7-s2-margin-maskoff

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v7-s2-margin-maskoff model is a 1.5 billion parameter language model fine-tuned using the TRL framework. It was trained with GRPO, a method specifically designed for mathematical reasoning, as introduced in the DeepSeekMath paper. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction, making it suitable for applications in scientific computing and quantitative analysis.

Loading preview...

Overview

The zhaohq/PureRL-1.5B-v7-s2-margin-maskoff is a 1.5 billion parameter language model, fine-tuned using the Transformer Reinforcement Learning (TRL) framework. This model incorporates the GRPO (Generalized Reinforcement Learning with Policy Optimization) training method, which was originally introduced in the DeepSeekMath paper to enhance mathematical reasoning capabilities in large language models.

Key Capabilities

  • Enhanced Mathematical Reasoning: Leverages the GRPO training method to improve performance on complex mathematical problems and logical deduction tasks.
  • Reinforcement Learning Fine-tuning: Benefits from the TRL framework for robust and efficient fine-tuning.
  • Moderate Parameter Count: At 1.5 billion parameters, it offers a balance between performance and computational efficiency compared to larger models.

Training Details

The model's training procedure utilized GRPO, as detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The development environment included TRL 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1.

Good For

  • Applications requiring strong mathematical problem-solving.
  • Tasks involving logical reasoning and quantitative analysis.
  • Researchers and developers interested in models fine-tuned with advanced reinforcement learning techniques like GRPO.