Thrillcrazyer/Qwen-7B_NOTAC_GRPO

TEXT GENERATIONConcurrency Cost:1Model Size:7.6BQuant:FP8Ctx Length:32kPublished:Jan 7, 2026Architecture:Transformer Cold

Thrillcrazyer/Qwen-7B_NOTAC_GRPO is a 7.6 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen2.5-7B-Instruct. This model specializes in mathematical reasoning, having been trained on the DeepMath-103k dataset using the GRPO method. It is optimized for tasks requiring advanced mathematical problem-solving capabilities.

Loading preview...

Model Overview

Thrillcrazyer/Qwen-7B_NOTAC_GRPO is a 7.6 billion parameter language model derived from the Qwen/Qwen2.5-7B-Instruct architecture. Its primary distinction lies in its specialized fine-tuning for mathematical reasoning tasks.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model has been specifically trained on the DeepMath-103k dataset, which focuses on complex mathematical problems.
  • GRPO Training Method: It leverages the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath research, to improve its mathematical problem-solving abilities.
  • Instruction-tuned Base: Built upon an instruction-tuned base model, it is designed to follow user prompts effectively.

Training Details

The model was fine-tuned using the TRL (Transformer Reinforcement Learning) framework. The GRPO method, detailed in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," was central to its training procedure, aiming to push the boundaries of mathematical reasoning in open language models.

Ideal Use Cases

This model is particularly well-suited for applications requiring:

  • Solving mathematical problems and equations.
  • Generating explanations for mathematical concepts.
  • Assisting in educational tools focused on mathematics.
  • Research in advanced mathematical reasoning with LLMs.