Hi-Satoh/adv_MoE_sft3_dpo_merged
Hi-Satoh/adv_MoE_sft3_dpo_merged is a 4 billion parameter language model developed by Hi-Satoh, fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It utilizes Direct Preference Optimization (DPO) via the Unsloth library to enhance reasoning capabilities, specifically Chain-of-Thought, and improve the quality of structured responses. This model is provided with full-merged 16-bit weights, eliminating the need for adapter loading, and is optimized for tasks requiring aligned and coherent output based on preference data.
Loading preview...
Overview
Hi-Satoh/adv_MoE_sft3_dpo_merged is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. This model distinguishes itself through its application of Direct Preference Optimization (DPO), implemented using the Unsloth library, to align its responses with preferred outputs. It is provided with full-merged 16-bit weights, meaning no adapter loading is required for deployment.
Key Capabilities
- Enhanced Reasoning: Optimized to improve Chain-of-Thought reasoning processes.
- Structured Response Quality: Focuses on delivering higher quality, more structured outputs.
- Preference Alignment: Fine-tuned using DPO to align model behavior with specific preference datasets.
Training Details
The model underwent 4 epochs of DPO training with a learning rate of 1e-05 and a beta value of 0.2. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model, resulting in the provided full-merged weights.
Good for
- Applications requiring models with improved reasoning capabilities.
- Tasks where structured and coherent responses are critical.
- Scenarios benefiting from a model fine-tuned for specific output preferences.