Name: zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b1 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: zhaohq

Model Overview

This model, zhaohq/PureRL-1.5B-v7-s2-l2-kl-w3-b1, is a 1.5 billion parameter language model developed by zhaohq. It is a fine-tuned iteration of the zhaohq/PureRL-1.5B-v7-stage1-reasoning base model, leveraging a substantial 32768 token context window.

Key Training Details

The model's training procedure is notable for its use of GRPO (Generalized Reinforcement Learning from Policy Optimization). This method was originally introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). The application of GRPO suggests a focus on enhancing the model's reasoning abilities, potentially making it more robust for complex problem-solving tasks.

Training was conducted using the TRL library, with specific framework versions including TRL 0.16.0.dev0, Transformers 4.48.3, Pytorch 2.5.1+cu124, Datasets 4.0.0, and Tokenizers 0.21.1.

Potential Use Cases

Given its fine-tuning with a method aimed at improving reasoning, this model could be particularly suitable for:

General text generation: Capable of generating coherent and contextually relevant responses.
Reasoning-intensive tasks: Potentially performs well on tasks requiring logical deduction or problem-solving, especially those with a mathematical or structured reasoning component, due to its GRPO training.
Long-context applications: Its 32768 token context length allows for processing and generating longer texts while maintaining coherence.

Overview

Model Overview

Key Training Details

Potential Use Cases

Full Model Card (README)