Hi-Satoh/adv_sft_dpo_final_10_merged

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

Hi-Satoh/adv_sft_dpo_final_10_merged is a 4 billion parameter causal language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 by Hi-Satoh. Utilizing Direct Preference Optimization (DPO) via Unsloth, this model is specifically optimized to improve reasoning (Chain-of-Thought) and structured response quality. It offers enhanced alignment with preferred outputs, making it suitable for tasks requiring precise and well-reasoned answers.

Loading preview...

Model Overview

Hi-Satoh/adv_sft_dpo_final_10_merged is a 4 billion parameter language model developed by Hi-Satoh. It is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, enhanced through Direct Preference Optimization (DPO) using the Unsloth library. This model provides full-merged 16-bit weights, eliminating the need for adapter loading.

Key Capabilities

  • Improved Reasoning: Optimized to enhance Chain-of-Thought reasoning abilities.
  • Structured Response Quality: Focuses on generating higher quality, more structured outputs.
  • Preference Alignment: Aligned with preferred outputs based on a specific preference dataset.

Training Details

The model was trained for 1 epoch with a learning rate of 7e-07 and a beta value of 0.1. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.

Usage Considerations

This model is licensed under the MIT License, as per its training data. Users must also adhere to the original base model's license terms. The training data used for DPO is sourced from Hi-Satoh/test_dpo_dataset.

Good for

  • Applications requiring enhanced reasoning capabilities.
  • Generating structured and high-quality text responses.
  • Use cases where alignment with specific output preferences is crucial.