Hi-Satoh/adv_sft_dpo_final_7_merged
Hi-Satoh/adv_sft_dpo_final_7_merged is a 4 billion parameter causal language model developed by Hi-Satoh, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). This model is specifically optimized to improve reasoning capabilities, particularly Chain-of-Thought, and enhance structured response quality. It is designed for tasks requiring aligned and coherent outputs based on preference datasets.
Loading preview...
Model Overview
Hi-Satoh/adv_sft_dpo_final_7_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged into the base model.
Key Optimizations
This model's primary objective during training was to enhance its ability to generate reasoned responses (Chain-of-Thought) and produce high-quality structured outputs. This was achieved by aligning the model's behavior with preferred examples through DPO, utilizing a specific preference dataset.
Training Details
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Methodology: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 1e-06
- Beta: 0.1
- Maximum Sequence Length: 4096 tokens
- LoRA Configuration: r=8, alpha=16 (merged)
Intended Use Cases
This model is particularly well-suited for applications where improved reasoning, coherent thought processes, and structured output generation are critical. Its DPO-based fine-tuning aims to provide more aligned and preferred responses compared to its base model, making it valuable for tasks requiring nuanced and well-organized text generation.