Hi-Satoh/adv_sft_dpo_final_8_merged
Hi-Satoh/adv_sft_dpo_final_8_merged is a 4 billion parameter causal language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized to improve reasoning capabilities (Chain-of-Thought) and structured response quality. It excels in generating aligned responses based on preferred outputs, making it suitable for tasks requiring high-quality, structured text generation.
Loading preview...
Model Overview
Hi-Satoh/adv_sft_dpo_final_8_merged is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library, integrating the full-merged 16-bit weights directly, eliminating the need for adapter loading.
Key Optimizations
This model's primary objective was to enhance its ability to produce preferred outputs, specifically focusing on:
- Improved Reasoning: Optimized for Chain-of-Thought (CoT) capabilities.
- Structured Response Quality: Enhanced generation of well-structured and aligned text based on preference datasets.
Training Details
The DPO training involved:
- Base Model: Qwen/Qwen3-4B-Instruct-2507
- Method: Direct Preference Optimization (DPO)
- Epochs: 1
- Learning Rate: 5e-07
- Beta: 0.1
- Max Sequence Length: 4096
- LoRA Configuration: r=8, alpha=16 (weights merged into the base model)
Usage and Licensing
The model can be loaded using the transformers library with torch.float16 for efficient inference. It was trained on the Hi-Satoh/test_dpo_dataset and is released under the MIT License, with users also required to comply with the original base model's license terms.