zhaohq/PureRL-1.5B-v7-s2-async-l2-maskon-afew

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:1.5BQuant:BF16Ctx Length:32kPublished:May 20, 2026Architecture:Transformer Warm

The zhaohq/PureRL-1.5B-v7-s2-async-l2-maskon-afew model is a 1.5 billion parameter language model, fine-tuned from zhaohq/PureRL-1.5B-v7-stage1-A-fewshot using the TRL framework. It was trained with the GRPO method, as introduced in the DeepSeekMath paper, specializing it for enhanced mathematical reasoning capabilities. This model is designed to improve performance on complex reasoning tasks, particularly those involving mathematical problem-solving, with a context length of 32768 tokens.

Loading preview...

Model Overview

The zhaohq/PureRL-1.5B-v7-s2-async-l2-maskon-afew is a 1.5 billion parameter language model, building upon the zhaohq/PureRL-1.5B-v7-stage1-A-fewshot base. It leverages the TRL (Transformer Reinforcement Learning) framework for its fine-tuning process.

Key Capabilities

  • Mathematical Reasoning: This model was specifically trained using the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method, a technique highlighted in the DeepSeekMath paper. This training approach aims to significantly enhance its ability to handle and solve complex mathematical reasoning problems.
  • Extended Context: It supports a substantial context length of 32768 tokens, allowing for processing longer inputs and maintaining coherence over extended dialogues or problem descriptions.
  • Instruction Following: As a fine-tuned model, it is optimized for following instructions, making it suitable for various prompt-based applications.

Training Details

The model's training procedure utilized GRPO, a method designed to push the boundaries of mathematical reasoning in open language models. The training was conducted using TRL version 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1. Further details on the training run can be visualized via Weights & Biases.

Good For

  • Applications requiring strong mathematical reasoning.
  • Tasks benefiting from a model trained with advanced reinforcement learning techniques.
  • Scenarios where a balance between model size (1.5B parameters) and specialized reasoning capabilities is desired.