ssurface/qwen3-4b-gdpo-length-sft-l5

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jul 1, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The ssurface/qwen3-4b-gdpo-length-sft-l5 is a 4 billion parameter Qwen3-based language model, fine-tuned using SFT and GRPO with a new reward mechanism. It is specifically optimized for compressed chain-of-thought reasoning at an extreme Level 5. This model excels at complex problem-solving requiring highly efficient and structured reasoning processes.

Loading preview...

Overview

This model, ssurface/qwen3-4b-gdpo-length-sft-l5, is a 4 billion parameter variant of the Qwen3-Instruct architecture. It has undergone a specialized fine-tuning process involving Supervised Fine-Tuning (SFT) followed by Gradient-based Reward Policy Optimization (GRPO) with a novel reward function. The primary goal of this training pipeline is to enhance the model's ability to perform compressed chain-of-thought reasoning at an advanced, "Level 5 (Extreme)" proficiency.

Key Capabilities

  • Extreme Compressed Chain-of-Thought Reasoning: Designed to generate highly efficient and concise reasoning steps for complex problems.
  • Qwen3-4B-Instruct Base: Leverages the strong foundational capabilities of the Qwen3-4B-Instruct model.
  • Advanced Fine-tuning: Utilizes a multi-stage training approach (SFT then GRPO with a new reward) for specialized performance.

Good For

  • Applications requiring highly efficient and structured reasoning outputs.
  • Scenarios where verbose chain-of-thought is undesirable, favoring compressed logical steps.
  • Complex problem-solving tasks that benefit from advanced reasoning capabilities.