Hi-Satoh/adv_sft_dpo_final_12_merged
Hi-Satoh/adv_sft_dpo_final_12_merged is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model, developed by Hi-Satoh, is optimized to improve reasoning (Chain-of-Thought) and structured response quality. It provides full-merged 16-bit weights and is suitable for applications requiring aligned, high-quality text generation.
Loading preview...
Overview
Hi-Satoh/adv_sft_dpo_final_12_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, specifically targeting improvements in response alignment and quality.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning abilities.
- Structured Response Quality: Focuses on generating more structured and preferred outputs.
- DPO Fine-tuning: Leverages DPO with a beta of 0.1 and a learning rate of 2e-07 over 1 epoch.
- Full-Merged Weights: This repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.
Training Details
The model was trained with a maximum sequence length of 4096 tokens. The LoRA configuration used during training (r=8, alpha=16) has been merged into the base model. The training data utilized is [Hi-Satoh/test_dpo_dataset].
Licensing
This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the original base model's license terms.