Name: zhaohq/PureRL-7B-v7-stage1-reasoning-qa API
Brand: Featherless.ai
Price: 10.00 USD
Availability: InStock
Author: zhaohq

Model Overview

zhaohq/PureRL-7B-v7-stage1-reasoning-qa is a 7.6 billion parameter language model built upon the Qwen/Qwen2.5-7B architecture. This model has undergone specialized fine-tuning using the TRL (Transformer Reinforcement Learning) framework, specifically incorporating the GRPO (Gradient-based Reward Policy Optimization) method.

Key Capabilities

Enhanced Reasoning: The model is specifically optimized for reasoning tasks, drawing from advancements in mathematical reasoning as seen in the DeepSeekMath paper.
Question Answering: It demonstrates proficiency in handling complex question-answering scenarios, making it suitable for applications requiring analytical responses.
GRPO Training: Utilizes the GRPO method, as detailed in the DeepSeekMath paper, to improve its reasoning abilities.

Training Details

The training process was tracked and can be visualized via Weights & Biases. The model was developed using specific versions of key frameworks:

TRL: 0.16.0.dev0
Transformers: 4.57.6
Pytorch: 2.10.0

This model is a strong candidate for use cases demanding robust reasoning and accurate question-answering capabilities, particularly where the underlying Qwen2.5-7B base model's strengths are further amplified by targeted reinforcement learning.

Overview

Model Overview

Key Capabilities

Training Details

Full Model Card (README)