clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-red-summit

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 18, 2026Architecture:Transformer Cold

The clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-red-summit is a 4 billion parameter instruction-tuned language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It leverages the GRPO training method, as introduced in the DeepSeekMath paper, to enhance its capabilities. This model is specifically optimized for tasks requiring advanced reasoning, particularly in mathematical contexts, and is suitable for applications demanding precise and logical responses within its 32K context window.

Loading preview...

Model Overview

This model, clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-red-summit, is a fine-tuned variant of the Qwen3-4B-Instruct-2507 base model. It has been specifically trained using the GRPO (Gradient-based Reward Policy Optimization) method, a technique highlighted in the research paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models". This training approach aims to improve the model's ability to handle complex reasoning tasks.

Key Capabilities

  • Enhanced Reasoning: Benefits from GRPO training, suggesting improved performance in tasks requiring logical deduction and problem-solving.
  • Instruction Following: As an instruction-tuned model, it is designed to accurately interpret and execute user prompts.
  • Mathematical Aptitude: The underlying GRPO method's origin in mathematical reasoning research implies a potential strength in mathematical and logical tasks.

Training Details

The model was fine-tuned using the TRL (Transformers Reinforcement Learning) library. The training process utilized specific versions of frameworks including TRL 1.5.1, Transformers 5.9.0, Pytorch 2.11.0+cu130, Datasets 4.8.5, and Tokenizers 0.22.2. Further details on the training run are available via Weights & Biases.

Good For

  • Applications requiring robust instruction following.
  • Tasks that benefit from enhanced reasoning capabilities, especially those with a mathematical or logical component.
  • Developers looking for a 4B parameter model with specialized training for complex problem-solving.