Hi-Satoh/adv_sft_dpo_final_13_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kTool Calling:SupportedPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_13_merged is a 4 billion parameter causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). Developed by Hi-Satoh, this model is optimized to improve reasoning capabilities through Chain-of-Thought and enhance structured response quality. It is designed for applications requiring aligned and high-quality text generation based on preferred outputs.

Loading preview...

Model Overview

This model, developed by Hi-Satoh, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full 16-bit weights merged into the base model.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning, enabling more logical and step-by-step responses.
  • Structured Output Quality: Focuses on generating higher quality, more structured responses based on preference datasets.
  • DPO Alignment: Utilizes DPO to align model outputs with preferred examples, leading to more desirable and controlled text generation.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. It was trained with a maximum sequence length of 4096 tokens, using a LoRA configuration (r=8, alpha=16) that was subsequently merged. The training data used is [Hi-Satoh/test_dpo_dataset].

Licensing

This model is released under the MIT License, consistent with the dataset terms. Users must also adhere to the original base model's license terms.