Hi-Satoh/adv_sft_dpo_final_1_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_1_merged is a 4 billion parameter instruction-tuned causal language model developed by Hi-Satoh. It is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, optimized using Direct Preference Optimization (DPO) to enhance reasoning (Chain-of-Thought) and structured response quality. This model is designed for tasks requiring improved alignment with preferred outputs and better logical coherence.

Loading preview...

Overview

Hi-Satoh/adv_sft_dpo_final_1_merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on quality improvements.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
  • Structured Response Quality: Designed to produce more coherent and structured outputs.
  • DPO Fine-tuning: Utilizes DPO with a specific preference dataset (Hi-Satoh/test_dpo_dataset) for better alignment.
  • Full-merged Weights: Contains full-merged 16-bit weights, eliminating the need for adapter loading.

Training Configuration Highlights

  • Method: Direct Preference Optimization (DPO)
  • Epochs: 1
  • Learning Rate: 5e-07
  • Max Sequence Length: 4096 tokens

Usage Considerations

This model is suitable for applications where improved reasoning and structured, aligned responses are critical. Users should be aware that the model's license follows the MIT License, as per the training data, and compliance with the original base model's license terms is required.