Hi-Satoh/adv_sft_dpo_final_8_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_8_merged is a 4 billion parameter causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to improve reasoning capabilities (Chain-of-Thought) and structured response quality. It excels in generating aligned responses based on preferred outputs, making it suitable for tasks requiring high-quality, structured text generation.

Loading preview...

Model Overview

Hi-Satoh/adv_sft_dpo_final_8_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, integrating the full-merged 16-bit weights directly, eliminating the need for adapter loading.

Key Optimizations

This model's primary objective was to enhance its ability to produce preferred outputs, specifically focusing on:

  • Improved Reasoning: Optimized for Chain-of-Thought (CoT) capabilities.
  • Structured Response Quality: Enhanced generation of well-structured and aligned text based on preference datasets.

Training Details

The DPO training involved:

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Method: Direct Preference Optimization (DPO)
  • Epochs: 1
  • Learning Rate: 5e-07
  • Beta: 0.1
  • Max Sequence Length: 4096
  • LoRA Configuration: r=8, alpha=16 (weights merged into the base model)

Usage and Licensing

The model can be loaded using the transformers library with torch.float16 for efficient inference. It was trained on the Hi-Satoh/test_dpo_dataset and is released under the MIT License, with users also required to comply with the original base model's license terms.