Name: zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b0 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: zhaohq

Model Overview

This model, zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b0, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned iteration of the zhaohq/PureRL-1.5B-v7-stage1-reasoning model, indicating a specialized focus on refining reasoning abilities.

Key Capabilities & Training

Enhanced Reasoning: The model's training incorporates the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This suggests a strong emphasis on improving mathematical and general reasoning performance.
Fine-tuned with TRL: It was trained using the TRL (Transformer Reinforcement Learning) framework, a common approach for aligning language models with human preferences or specific task objectives.

Good For

Mathematical Reasoning Tasks: Given its foundation and the application of the GRPO method, this model is particularly suited for applications requiring robust mathematical problem-solving and reasoning.
Research and Development: Developers and researchers interested in exploring models fine-tuned with advanced reinforcement learning techniques like GRPO for reasoning tasks may find this model valuable.

Overview

Model Overview

Key Capabilities & Training

Good For

Full Model Card (README)