Hi-Satoh/adv_sft_dpo_final_5_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_5_merged is a 4 billion parameter instruction-tuned causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via the Unsloth library. This model is specifically optimized to improve reasoning capabilities through Chain-of-Thought and enhance structured response quality. It features a 32768 token context length and is designed for tasks requiring aligned, high-quality outputs based on preference datasets.

Loading preview...

Model Overview

This model, Hi-Satoh/adv_sft_dpo_final_5_merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been further fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its response quality and alignment.

Key Capabilities

  • Improved Reasoning: Optimized to generate better Chain-of-Thought reasoning, leading to more coherent and logical outputs.
  • Enhanced Structured Responses: Focuses on producing high-quality, structured answers based on preference datasets.
  • DPO Fine-tuning: Leverages DPO to align model behavior with preferred human outputs, improving overall utility.
  • Full-merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.5. It supports a maximum sequence length of 4096 tokens during training. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Good For

  • Applications requiring models with strong reasoning abilities.
  • Use cases where structured and aligned responses are critical.
  • Tasks benefiting from preference-tuned outputs.

Licensing

This model is released under the MIT License, consistent with its training dataset. Users must also adhere to the license terms of the original base model, Qwen/Qwen3-4B-Instruct-2507.