nyannto/dpo-qwen-cot-merged13
The nyannto/dpo-qwen-cot-merged13 model is a 4 billion parameter Qwen3-4B-Instruct-2507 variant, fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning capabilities and structured response quality. It leverages a 32768 token context length and is specifically optimized for Chain-of-Thought (CoT) reasoning. This model is designed to provide aligned and coherent outputs for complex prompts, making it suitable for tasks requiring improved logical flow and structured answers.
Loading preview...
Model Overview
nyannto/dpo-qwen-cot-merged13 is a 4 billion parameter language model, fine-tuned from the Qwen/Qwen3-4B-Instruct-2507 base model. It utilizes Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs, focusing on enhancing reasoning and structured response quality.
Key Capabilities
- Improved Reasoning (Chain-of-Thought): Optimized to generate more logical and step-by-step reasoning processes.
- Enhanced Structured Responses: Aligned to produce higher quality, well-organized outputs based on preference data.
- DPO Fine-tuning: Benefits from DPO for better alignment with human preferences.
- Full-merged 16-bit weights: Ready for direct use with
transformerswithout requiring adapter loading.
Training Details
The model was trained for 1 epoch with a learning rate of 2e-05 and a maximum sequence length of 768, using the u-10bei/dpo-dataset-qwen-cot dataset. The LoRA configuration (r=8, alpha=16) was merged into the base model.
Usage Considerations
This model is suitable for applications where coherent reasoning and structured, aligned outputs are critical. Users should adhere to the MIT License of the training data and the original base model's license terms.