Sumiokashi/qwen3-4b-structured-3k-mix-sft_lora-dpo-qwen-cot-merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Sumiokashi/qwen3-4b-structured-3k-mix-sft_lora-dpo-qwen-cot-merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to improve reasoning capabilities through Chain-of-Thought and enhance structured response quality. It is designed for applications requiring aligned and coherent outputs, particularly in complex reasoning tasks.

Loading preview...

Model Overview

This model, developed by Sumiokashi, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a fully merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, making it suitable for tasks requiring logical progression.
  • Structured Responses: Fine-tuned to produce higher quality structured outputs, aligning with preferred response formats.
  • DPO Alignment: Utilizes DPO to align model responses with desired outputs, based on a specific preference dataset.

Training Details

The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Good For

  • Applications requiring improved reasoning and logical output.
  • Scenarios where structured and aligned responses are critical.
  • Developers looking for a DPO-optimized Qwen3-4B variant for specific alignment tasks.