ssurface/qwen3-4b-gdpo-length-sft-l1

TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Jul 1, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

The ssurface/qwen3-4b-gdpo-length-sft-l1 is a 4 billion parameter Qwen3-based causal language model with a 32768 token context length. It is fine-tuned using SFT and GRPO with a new reward mechanism to specialize in compressed chain-of-thought reasoning at a verbose (Level 1) detail. This model is designed for tasks requiring detailed, step-by-step problem-solving outputs.

Loading preview...

Model Overview

The ssurface/qwen3-4b-gdpo-length-sft-l1 is a 4 billion parameter language model built upon the Qwen3-4B-Instruct architecture. This model has undergone a specialized fine-tuning process to excel in generating verbose, compressed chain-of-thought reasoning, designated as "Level 1 (Verbose)".

Key Capabilities

  • Compressed Chain-of-Thought Reasoning: Optimized to produce detailed, step-by-step reasoning in a concise format.
  • Verbose Output (Level 1): Specifically tuned to provide a high level of detail in its reasoning explanations.
  • Qwen3-4B-Instruct Base: Leverages the foundational capabilities of the Qwen3-4B-Instruct model.

Training Methodology

The model's unique capabilities are a result of a multi-stage training pipeline:

  1. Initial SFT LoRA: Started with Qwen/Qwen3-4B-Instruct-2507 and applied Supervised Fine-Tuning (SFT) using LoRA, specifically ssurface/qwen3-4b-cot-compress-l1.
  2. GRPO with New Reward: The SFT-merged model was then further fine-tuned using Gradient Regularized Policy Optimization (GRPO) incorporating a novel reward mechanism.

Ideal Use Cases

This model is particularly suited for applications where detailed, yet structured, reasoning is required, such as:

  • Problem-solving explanations.
  • Educational content generation requiring step-by-step breakdowns.
  • Any task benefiting from explicit, verbose reasoning paths.