Hi-Satoh/adv_sft_dpo_final_7_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_7_merged is a 4 billion parameter causal language model developed by Hi-Satoh, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. It is designed for tasks requiring aligned and coherent outputs based on preference datasets.

Loading preview...

Model Overview

Hi-Satoh/adv_sft_dpo_final_7_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model.

Key Optimizations

This model's primary objective during training was to enhance its ability to generate reasoned responses (Chain-of-Thought) and produce high-quality structured outputs. This was achieved by aligning the model's behavior with preferred examples through DPO, utilizing a specific preference dataset.

Training Details

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • Methodology: Direct Preference Optimization (DPO)
  • Epochs: 1
  • Learning Rate: 1e-06
  • Beta: 0.1
  • Maximum Sequence Length: 4096 tokens
  • LoRA Configuration: r=8, alpha=16 (merged)

Intended Use Cases

This model is particularly well-suited for applications where improved reasoning, coherent thought processes, and structured output generation are critical. Its DPO-based fine-tuning aims to provide more aligned and preferred responses compared to its base model, making it valuable for tasks requiring nuanced and well-organized text generation.