clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-lime-orbit

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 23, 2026Architecture:Transformer Cold

The clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-lime-orbit model is a 4 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It was trained using the TRL framework and incorporates the GRPO method, which is designed to enhance mathematical reasoning capabilities. This model is optimized for tasks requiring strong logical and mathematical problem-solving, offering a 32768 token context length.

Loading preview...

Model Overview

This model, clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-lime-orbit, is an instruction-tuned variant of the Qwen3-4B-Instruct-2507 base model, featuring 4 billion parameters and a substantial 32768 token context window. It was developed using the TRL (Transformers Reinforcement Learning) framework.

Key Differentiator: GRPO Training

A significant aspect of this model's training is the application of GRPO (Gradient-based Reward Optimization). This method, introduced in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (arXiv:2402.03300), is specifically designed to improve a model's proficiency in mathematical reasoning tasks. This suggests an enhanced capability for handling complex numerical and logical problems compared to models not trained with this technique.

Intended Use Cases

Given its fine-tuning with GRPO, this model is particularly well-suited for applications that demand robust mathematical and logical reasoning. Developers looking for a compact yet capable model for tasks such as:

  • Solving mathematical word problems
  • Generating logical explanations
  • Assisting with data analysis interpretations

This model provides a specialized option within the 4B parameter class, leveraging advanced training techniques for specific performance gains.