Name: zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b0 API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: zhaohq

Model Overview

zhaohq/PureRL-1.5B-v7-s2-l2-kl-w0-b0 is a 1.5 billion parameter language model, building upon the zhaohq/PureRL-1.5B-v7-stage1-reasoning base. It features a substantial context length of 32768 tokens, making it suitable for processing longer inputs and maintaining conversational coherence over extended interactions.

Key Training Details

This model was fine-tuned using the GRPO (Gradient-based Reward Policy Optimization) method. GRPO is a technique highlighted in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300). This indicates a strong focus on improving the model's ability to handle complex reasoning tasks, particularly in mathematical domains. The training was conducted using the TRL library.

Potential Use Cases

Given its training methodology and base model, this model is likely well-suited for:

Mathematical Reasoning: Tasks requiring logical deduction, problem-solving, and numerical understanding.
Complex Question Answering: Handling intricate questions that demand multi-step reasoning.
Long-Context Applications: Scenarios where understanding and generating text over extended conversations or documents is crucial.

Overview

Model Overview

Key Training Details

Potential Use Cases

Full Model Card (README)