zhaohq/PureRL-1.5B-v7-s2-l2-maskon

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

PureRL-1.5B-v7-s2-l2-maskon is a 1.5 billion parameter language model developed by zhaohq, fine-tuned using the TRL framework. This model was trained with GRPO, a method specifically designed to enhance mathematical reasoning capabilities. With a context length of 32768 tokens, it is optimized for tasks requiring advanced mathematical problem-solving and logical deduction.

Loading preview...

Overview

zhaohq/PureRL-1.5B-v7-s2-l2-maskon is a 1.5 billion parameter language model, fine-tuned using the TRL (Transformer Reinforcement Learning) framework. This model leverages the GRPO (Generative Reinforcement Learning with Policy Optimization) training method, which is detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". The training procedure utilized specific versions of TRL (0.16.0.dev0), Transformers (4.57.6), Pytorch (2.10.0), Datasets (4.8.5), and Tokenizers (0.22.2).

Key Capabilities

  • Enhanced Mathematical Reasoning: Trained with GRPO, a method focused on improving mathematical problem-solving.
  • Reinforcement Learning Fine-tuning: Utilizes the TRL library for advanced fine-tuning techniques.
  • Large Context Window: Supports a context length of 32768 tokens, allowing for processing longer inputs and complex problems.

Good For

  • Applications requiring strong mathematical reasoning.
  • Research into reinforcement learning fine-tuning methods for language models.
  • Tasks benefiting from a model with a substantial context window for detailed problem analysis.