Hi-Satoh/adv_sft5_dpo3_merged
The Hi-Satoh/adv_sft5_dpo3_merged model is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). Developed by Hi-Satoh, this model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. It is designed for applications requiring aligned and high-quality text generation based on preferred outputs.
Loading preview...
Model Overview
Hi-Satoh/adv_sft5_dpo3_merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs. It is provided as full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
- Structured Response Quality: Focuses on generating higher quality and more structured outputs.
- DPO Alignment: Benefits from DPO training to align model behavior with specific preferences.
Training Details
The model was trained for 2 epochs with a learning rate of 1e-06 and a beta value of 0.05. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used was [Hi-Satoh/test_dpo_dataset].
Usage Considerations
This model is suitable for tasks where improved reasoning and structured, aligned responses are critical. Users should adhere to the MIT License, as per the dataset terms, and comply with the original base model's license terms.