yangerine/grpo-baseline-lr1e5-l1

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 31, 2026Architecture:Transformer0.0K Cold

The yangerine/grpo-baseline-lr1e5-l1 model is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities in large language models. This model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction. Its primary strength lies in its ability to process and generate responses for complex mathematical and reasoning-based queries.

Loading preview...

Model Overview

The yangerine/grpo-baseline-lr1e5-l1 is a 4 billion parameter language model, fine-tuned from the robust Qwen/Qwen3-4B architecture. This model distinguishes itself through its specialized training methodology: it was developed using GRPO (Gradient Regularized Policy Optimization), a technique introduced in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300).

Key Capabilities

  • Enhanced Mathematical Reasoning: The core differentiator of this model is its optimization for mathematical problem-solving and logical deduction, stemming from its GRPO training.
  • Qwen3-4B Foundation: Benefits from the strong base capabilities of the Qwen3-4B model, providing a solid general language understanding and generation foundation.
  • TRL Framework: Trained using the TRL (Transformers Reinforcement Learning) library, indicating a focus on instruction following and alignment.

Good For

  • Mathematical Tasks: Ideal for applications requiring precise mathematical calculations, proofs, and problem-solving.
  • Reasoning-Intensive Queries: Suitable for scenarios where logical inference and structured thinking are paramount.
  • Research and Development: Provides a strong baseline for further experimentation and fine-tuning on specific mathematical or reasoning datasets.