Hi-Satoh/adv_sft_dpo_w_merged
Hi-Satoh/adv_sft_dpo_w_merged is a 4 billion parameter causal language model developed by Hi-Satoh, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to improve reasoning capabilities and structured response quality. It provides full-merged 16-bit weights, making it suitable for applications requiring enhanced logical coherence and structured output.
Loading preview...
Model Overview
Hi-Satoh/adv_sft_dpo_w_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its full-merged 16-bit weights provided for direct use without adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought reasoning.
- Structured Response Quality: Focuses on generating higher quality, structured outputs based on preference datasets.
- Efficient Fine-tuning: Utilizes DPO with a specific configuration (1 epoch, 5e-07 learning rate, beta 0.5, max sequence length 4096) to achieve its alignment goals.
Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Methodology: Direct Preference Optimization (DPO)
- Training Data: Utilized the Hi-Satoh/test_dpo_dataset for preference alignment.
Usage Considerations
This model is designed for tasks where improved reasoning and structured, preferred responses are critical. Users should be aware that the model's license follows the MIT License, and compliance with the original base model's license terms is required.