Model Overview
Hi-Satoh/adv_sft_dpo_final_10_merged is a 4 billion parameter language model developed by Hi-Satoh. It is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 base model, enhanced through Direct Preference Optimization (DPO) using the Unsloth library. This model provides full-merged 16-bit weights, eliminating the need for adapter loading.
Key Capabilities
- Improved Reasoning: Optimized to enhance Chain-of-Thought reasoning abilities.
- Structured Response Quality: Focuses on generating higher quality, more structured outputs.
- Preference Alignment: Aligned with preferred outputs based on a specific preference dataset.
Training Details
The model was trained for 1 epoch with a learning rate of 7e-07 and a beta value of 0.1. The maximum sequence length used during training was 4096 tokens. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Usage Considerations
This model is licensed under the MIT License, as per its training data. Users must also adhere to the original base model's license terms. The training data used for DPO is sourced from Hi-Satoh/test_dpo_dataset.
Good for
- Applications requiring enhanced reasoning capabilities.
- Generating structured and high-quality text responses.
- Use cases where alignment with specific output preferences is crucial.