rkumagai/dpo-qwen-cot-merged
The rkumagai/dpo-qwen-cot-merged model is a 4 billion parameter language model fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It utilizes Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for generating aligned and coherent outputs based on preferred response patterns, making it suitable for tasks requiring improved logical flow and structured answers.
Loading preview...
Model Overview
The rkumagai/dpo-qwen-cot-merged model is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library, with its 16-bit weights fully merged, eliminating the need for adapter loading.
Key Capabilities
- Enhanced Reasoning (Chain-of-Thought): Optimized through DPO to improve the model's ability to generate logical, step-by-step reasoning processes.
- Improved Structured Responses: Training focused on aligning outputs with preferred formats, leading to higher quality and more structured answers.
- Direct Use: As a fully merged model, it can be used directly with the
transformerslibrary for inference without additional configuration.
Training Details
The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The training utilized the u-10bei/dpo-dataset-qwen-cot dataset, specifically designed for DPO with Chain-of-Thought preferences. The maximum sequence length during training was 1024 tokens.
Good For
- Applications requiring models with strong reasoning and Chain-of-Thought capabilities.
- Scenarios where structured and aligned responses are critical.
- Developers looking for a readily deployable, merged model for DPO-enhanced tasks.