Name: zhaohq/PureRL-1.5B-v7-s2-margin-maskoff API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: zhaohq

Overview

The zhaohq/PureRL-1.5B-v7-s2-margin-maskoff is a 1.5 billion parameter language model, fine-tuned using the Transformer Reinforcement Learning (TRL) framework. This model incorporates the GRPO (Generalized Reinforcement Learning with Policy Optimization) training method, which was originally introduced in the DeepSeekMath paper to enhance mathematical reasoning capabilities in large language models.

Key Capabilities

Enhanced Mathematical Reasoning: Leverages the GRPO training method to improve performance on complex mathematical problems and logical deduction tasks.
Reinforcement Learning Fine-tuning: Benefits from the TRL framework for robust and efficient fine-tuning.
Moderate Parameter Count: At 1.5 billion parameters, it offers a balance between performance and computational efficiency compared to larger models.

Training Details

The model's training procedure utilized GRPO, as detailed in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The development environment included TRL 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1, Datasets 4.0.0, and Tokenizers 0.21.1.

Good For

Applications requiring strong mathematical problem-solving.
Tasks involving logical reasoning and quantitative analysis.
Researchers and developers interested in models fine-tuned with advanced reinforcement learning techniques like GRPO.

Overview

Overview

Key Capabilities

Training Details

Good For

Full Model Card (README)