Model Overview

This model, dpo-qwen-cot-merged, is a 4 billion parameter variant of the Qwen3 architecture, specifically fine-tuned from Qwen/Qwen3-4B-Instruct-2507. It leverages Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.

Key Capabilities

Enhanced Reasoning (Chain-of-Thought): Optimized to improve the model's ability to generate step-by-step reasoning processes.
Improved Structured Responses: Focuses on producing higher quality and more coherent structured outputs.
DPO Fine-tuning: Utilizes DPO with a preference dataset to guide response generation towards desired characteristics.
Merged Weights: Contains full 16-bit merged weights, eliminating the need for adapter loading and simplifying deployment with transformers.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-07 and a beta value of 0.1. The maximum sequence length during training was 1024. The LoRA configuration (r=8, alpha=16) was merged into the base model. The training data used for DPO was sourced from [u-10bei/dpo-dataset-qwen-cot].

Licensing

This model is released under the MIT License, consistent with the terms of its training dataset. Users must also adhere to the license terms of the original base model, Qwen3-4B-Instruct-2507.

Overview

Model Overview

Key Capabilities

Training Details

Licensing

Full Model Card (README)