ssurface/qwen3-4b-gdpo-length-sft-l1
The ssurface/qwen3-4b-gdpo-length-sft-l1 is a 4 billion parameter Qwen3-based causal language model with a 32768 token context length. It is fine-tuned using SFT and GRPO with a new reward mechanism to specialize in compressed chain-of-thought reasoning at a verbose (Level 1) detail. This model is designed for tasks requiring detailed, step-by-step problem-solving outputs.
Loading preview...
Model Overview
The ssurface/qwen3-4b-gdpo-length-sft-l1 is a 4 billion parameter language model built upon the Qwen3-4B-Instruct architecture. This model has undergone a specialized fine-tuning process to excel in generating verbose, compressed chain-of-thought reasoning, designated as "Level 1 (Verbose)".
Key Capabilities
- Compressed Chain-of-Thought Reasoning: Optimized to produce detailed, step-by-step reasoning in a concise format.
- Verbose Output (Level 1): Specifically tuned to provide a high level of detail in its reasoning explanations.
- Qwen3-4B-Instruct Base: Leverages the foundational capabilities of the Qwen3-4B-Instruct model.
Training Methodology
The model's unique capabilities are a result of a multi-stage training pipeline:
- Initial SFT LoRA: Started with
Qwen/Qwen3-4B-Instruct-2507and applied Supervised Fine-Tuning (SFT) using LoRA, specificallyssurface/qwen3-4b-cot-compress-l1. - GRPO with New Reward: The SFT-merged model was then further fine-tuned using Gradient Regularized Policy Optimization (GRPO) incorporating a novel reward mechanism.
Ideal Use Cases
This model is particularly suited for applications where detailed, yet structured, reasoning is required, such as:
- Problem-solving explanations.
- Educational content generation requiring step-by-step breakdowns.
- Any task benefiting from explicit, verbose reasoning paths.