Hi-Satoh/adv_sft_dpo_final_14_merged
Hi-Satoh/adv_sft_dpo_final_14_merged is a 4 billion parameter Qwen3-based causal language model developed by Hi-Satoh. This model has been fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning capabilities and structured response quality. It is specifically optimized for generating aligned outputs based on preferred data, making it suitable for tasks requiring improved logical coherence and format adherence.
Loading preview...
Model Overview
Hi-Satoh/adv_sft_dpo_final_14_merged is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a merged 16-bit weights model that requires no adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning.
- Structured Response Quality: Focuses on generating higher quality, structured outputs.
- Preference Alignment: Aligned with preferred outputs through DPO training.
Training Details
The model was trained for 1 epoch with a learning rate of 4e-07 and a beta value of 0.1. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Intended Use Cases
This model is particularly well-suited for applications where improved logical reasoning and adherence to specific output structures are critical. Its DPO-based fine-tuning makes it a strong candidate for tasks requiring more aligned and coherent responses.