The ryoto0175/dpo-qwen-cot-merged-v01 is a 4 billion parameter Qwen3-based causal language model, fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library. This model is specifically optimized for improving reasoning capabilities, particularly Chain-of-Thought (CoT), and generating structured responses. It is designed for tasks requiring enhanced logical progression and coherent output based on preference alignment.
Loading preview...
Model Overview
This model, ryoto0175/dpo-qwen-cot-merged-v01, is a 4 billion parameter language model built upon the Qwen/Qwen3-4B-Instruct-2507 base architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, resulting in a merged 16-bit weights model that requires no adapter loading.
Key Capabilities & Optimization
The primary objective of this model's training was to enhance its ability to generate preferred outputs, with a specific focus on:
- Improved Reasoning: Optimized for Chain-of-Thought (CoT) reasoning, enabling more logical and step-by-step problem-solving.
- Structured Response Quality: Aligned to produce higher quality, structured responses based on the provided preference dataset.
Training Details
- Methodology: Direct Preference Optimization (DPO).
- Dataset: Utilized
u-10bei/dpo-dataset-qwen-cotfor preference alignment. - Configuration: Trained for 1 epoch with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length used during training was 1024 tokens.
Usage & Licensing
As a fully merged model, it can be directly integrated and used with the transformers library. The model operates under the MIT License, consistent with its training data, and users must also adhere to the original base model's license terms.