Model Overview

This model, dpo-qwen-cot-merged, is a 4 billion parameter language model based on Qwen/Qwen3-4B-Instruct-2507. It has been fine-tuned using Direct Preference Optimization (DPO) with the Unsloth library to enhance its performance.

Key Capabilities & Optimization

The primary objective of this DPO fine-tuning was to align the model's responses with preferred outputs, specifically focusing on:

Improving reasoning abilities through Chain-of-Thought (CoT) processes.
Enhancing the quality of structured responses based on a preference dataset.

Training Details

Base Model: Qwen/Qwen3-4B-Instruct-2507
Method: Direct Preference Optimization (DPO)
Epochs: 1
Learning Rate: 1e-07
Max Sequence Length: 1024
The model provides full-merged 16-bit weights, eliminating the need for adapter loading.

Usage

This model can be directly integrated and used with the transformers library, similar to other merged models. It is licensed under the MIT License, with compliance required for the original base model's license terms.

Overview

Model Overview

Key Capabilities & Optimization

Training Details

Usage

Full Model Card (README)