shotalab/Qwen3-4B-Instruct-SFT-03-Merged-DPO-01
shotalab/Qwen3-4B-Instruct-SFT-03-Merged-DPO-01 is a 4 billion parameter instruction-tuned language model, fine-tuned from shotalab/Qwen3-4B-Instruct-SFT-03 using Direct Preference Optimization (DPO) via Unsloth. This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating structured responses. It is designed for applications requiring enhanced logical processing and coherent, well-formatted outputs.
Loading preview...
Overview
This model, shotalab/Qwen3-4B-Instruct-SFT-03-Merged-DPO-01, is a 4 billion parameter language model derived from shotalab/Qwen3-4B-Instruct-SFT-03. It has undergone further fine-tuning using Direct Preference Optimization (DPO), leveraging the Unsloth library to enhance its performance.
Key Capabilities
- Improved Reasoning: Optimized to enhance Chain-of-Thought (CoT) reasoning, allowing for more logical and step-by-step problem-solving.
- Structured Response Generation: Specifically aligned to produce higher quality, more structured outputs based on preferred response patterns.
- Full-Merged Weights: Distributed as full-merged 16-bit weights, eliminating the need for adapter loading and simplifying deployment.
Training Details
The model was trained for 0.3 epochs with a learning rate of 3e-07 and a beta value of 0.4, using a maximum sequence length of 1024. The DPO training utilized the u-10bei/dpo-dataset-qwen-cot dataset, which focuses on improving reasoning and structured responses. The model is released under the MIT License, with users required to comply with the original base model's license terms.
Good For
- Applications requiring enhanced logical reasoning and problem-solving.
- Generating well-structured and coherent text outputs.
- Scenarios where direct preference alignment for response quality is crucial.