Hi-Satoh/adv_sft_dpo_final_6_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 28, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_6_merged is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model specializes in improving reasoning (Chain-of-Thought) and structured response quality. It is optimized for aligning responses with preferred outputs based on its training dataset.

Loading preview...

Model Overview

This model, Hi-Satoh/adv_sft_dpo_final_6_merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
  • Structured Responses: Focuses on generating higher quality, structured outputs.
  • Preference Alignment: Trained with DPO to align responses more closely with desired preferences.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-06 and a beta value of 0.5. It utilized a maximum sequence length of 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged directly into the base model, meaning no adapter loading is required for usage.

Usage Considerations

Users should be aware that the model's license is MIT, as per its training dataset, and must also comply with the original base model's license terms.