Hi-Satoh/adv_sft_dpo_final_9_merged
Hi-Satoh/adv_sft_dpo_final_9_merged is a 4 billion parameter causal language model developed by Hi-Satoh, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. It is designed for applications requiring aligned and coherent outputs based on preferred data.
Loading preview...
Model Overview
Hi-Satoh/adv_sft_dpo_final_9_merged is a 4 billion parameter language model, fine-tuned by Hi-Satoh from the Qwen/Qwen3-4B-Instruct-2507 base model. This model leverages Direct Preference Optimization (DPO), implemented via the Unsloth library, to align its responses with preferred outputs.
Key Optimizations
The primary objective of this DPO fine-tuning was to enhance two critical areas:
- Reasoning (Chain-of-Thought): The model has been optimized to produce more coherent and logical step-by-step reasoning processes.
- Structured Response Quality: It aims to generate higher quality, well-organized responses, particularly when structured outputs are desired.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model, providing full-merged 16-bit weights without requiring adapter loading.
Licensing
This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the original base model's license terms.