Hi-Satoh/adv_MoE_ALF_sft3_merged
Hi-Satoh/adv_MoE_ALF_sft3_merged is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. Utilizing Direct Preference Optimization (DPO) via Unsloth, this model is specifically optimized to enhance reasoning capabilities, particularly Chain-of-Thought, and improve the quality of structured responses. It is designed for applications requiring aligned outputs based on preferred response patterns.
Loading preview...
Overview
This model, Hi-Satoh/adv_MoE_ALF_sft3_merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone fine-tuning using Direct Preference Optimization (DPO), implemented with the Unsloth library, to align its outputs with preferred response patterns.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, making it suitable for tasks requiring logical progression and structured thinking.
- Improved Response Quality: Focuses on generating higher-quality, more aligned structured responses based on preference datasets.
- Full-Merged Weights: Provided as full-merged 16-bit weights, eliminating the need for adapter loading.
Training Details
The model was trained for 2 epochs with a learning rate of 1e-06 and a beta value of 0.05. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Good For
- Applications requiring models with improved reasoning abilities.
- Scenarios where structured and aligned responses are critical.
- Developers looking for a Qwen3-4B variant with DPO-enhanced performance.