HillaryMori/qwen3-sft-dpo-combined_exp1 is a fine-tuned version of the Qwen/Qwen3-4B-Instruct-2507 model, optimized using Direct Preference Optimization (DPO) via Unsloth. This model focuses on improving reasoning through Chain-of-Thought and enhancing structured response quality. It is provided as full-merged 16-bit weights, requiring no adapter loading, and is an experimental result from an LLM fine-tuning competition.
Loading preview...
Overview
HillaryMori/qwen3-sft-dpo-combined_exp1 is an experimental fine-tuned language model based on Qwen/Qwen3-4B-Instruct-2507. It leverages Direct Preference Optimization (DPO) with the Unsloth library to enhance its performance. This model is distributed with full-merged 16-bit weights, simplifying deployment as it eliminates the need for adapter loading.
Key Capabilities
- Improved Reasoning: Optimized to enhance Chain-of-Thought reasoning abilities.
- Structured Responses: Focuses on generating higher quality structured outputs.
- DPO Fine-tuning: Utilizes Direct Preference Optimization for alignment with preferred response patterns.
Training Details
The model was trained for 0.5 epochs with a learning rate of 1e-07 and a beta value of 0.5, using a maximum sequence length of 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used for DPO was [u-10bei/dpo-dataset-qwen-cot].
Usage Considerations
This repository represents experimental results from an LLM fine-tuning competition. Users should be aware of its experimental nature. The model is licensed under the MIT License, and users must also comply with the original base model's license terms.