jordanpainter/qwen_grpo_100

TEXT GENERATIONConcurrency Cost:1Model Size:8BQuant:FP8Ctx Length:32kPublished:Mar 16, 2026Architecture:Transformer Cold

jordanpainter/qwen_grpo_100 is an 8 billion parameter language model, fine-tuned from srirag/sft-qwen-all using the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method. This model specializes in mathematical reasoning, leveraging techniques introduced in the DeepSeekMath paper. It is designed for tasks requiring advanced logical and mathematical problem-solving capabilities, offering a 32768 token context length.

Loading preview...

Model Overview

jordanpainter/qwen_grpo_100 is an 8 billion parameter language model, fine-tuned from the srirag/sft-qwen-all base model. This fine-tuning was performed using the TRL library and specifically employed the GRPO (Gradient-based Reinforcement Learning with Policy Optimization) method.

Key Capabilities

  • Mathematical Reasoning: The model's training with GRPO is inspired by the methodology presented in the DeepSeekMath paper, indicating a strong focus on enhancing mathematical problem-solving abilities.
  • Instruction Following: As a fine-tuned model, it is designed to follow user instructions effectively, as demonstrated by the quick start example.
  • Extended Context: Supports a context length of 32768 tokens, allowing for processing longer inputs and maintaining coherence over extended conversations or documents.

Training Details

The model's training process can be visualized via Weights & Biases, providing insights into its development. The GRPO method, as detailed in the DeepSeekMath research, aims to push the limits of mathematical reasoning in open language models.

Good For

  • Applications requiring strong mathematical and logical reasoning.
  • Tasks benefiting from advanced instruction following and extended context understanding.
  • Research and development in reinforcement learning for language models, particularly those focused on mathematical domains.