oretti/dpo-qwen-merged
The oretti/dpo-qwen-merged model is a 4 billion parameter language model based on the Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for generating aligned and coherent outputs, making it suitable for tasks requiring improved reasoning and structured text generation.
Loading preview...
Model Overview
The oretti/dpo-qwen-merged model is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone fine-tuning using Direct Preference Optimization (DPO) via the Unsloth library, with its full 16-bit weights merged for direct use without adapter loading.
Key Capabilities
- Enhanced Reasoning: Optimized through DPO to improve Chain-of-Thought (CoT) reasoning, enabling more structured and logical responses.
- Improved Output Quality: Focuses on aligning responses with preferred outputs, leading to higher quality and more coherent text generation.
- Direct Integration: Provided as a full-merged model, simplifying deployment with
transformersas no LoRA adapter loading is required.
Training Details
The model was trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.3, using a maximum sequence length of 1024. The training utilized the u-10bei/dpo-dataset-qwen-cot preference dataset, which is designed to improve reasoning and structured response generation.
Ideal Use Cases
This model is particularly well-suited for applications requiring:
- Complex Reasoning Tasks: Where structured and logical thought processes are beneficial.
- High-Quality Text Generation: For scenarios demanding aligned and coherent outputs.
- Instruction Following: Benefiting from the DPO fine-tuning for better adherence to preferred response styles.