clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-orange-quartz

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jun 25, 2026Architecture:Transformer Cold

This is a 4 billion parameter instruction-tuned language model, fine-tuned by clijo from the Qwen/Qwen3-4B-Instruct-2507 base model. It was trained using the GRPO method, which is designed to enhance mathematical reasoning capabilities. The model is optimized for tasks requiring advanced mathematical problem-solving and logical deduction, building upon its 32768-token context length.

Loading preview...

Model Overview

This model, clijo/qwen3-4b-instruct-2507-bf16-reco-grpo-b200-rapid-orange-quartz, is an instruction-tuned variant of the Qwen3-4B-Instruct-2507 base model. It features 4 billion parameters and supports a substantial context length of 32768 tokens, making it suitable for processing longer inputs and complex queries.

Key Capabilities

  • Enhanced Mathematical Reasoning: The model was specifically fine-tuned using the GRPO (Gradient-based Reward Policy Optimization) method, as introduced in the DeepSeekMath research. This training approach aims to significantly improve its performance on mathematical reasoning tasks.
  • Instruction Following: As an instruction-tuned model, it is designed to accurately understand and execute user instructions, providing relevant and coherent responses.
  • Foundation Model: Built upon the Qwen3-4B-Instruct-2507 architecture, it inherits a strong foundation for general language understanding and generation.

Training Details

The fine-tuning process utilized the TRL (Transformers Reinforcement Learning) framework. The application of GRPO, a technique highlighted in the paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" arXiv:2402.03300, indicates a focus on specialized reasoning abilities. This makes it a strong candidate for applications where precise logical and mathematical outputs are critical.