Model Overview

This model, dpo-qwen-cot-merged, is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, specifically targeting improvements in reasoning and structured response generation.

Key Capabilities

Enhanced Reasoning (Chain-of-Thought): Optimized to produce more coherent and logical reasoning steps in its outputs.
Improved Structured Responses: Fine-tuned to generate higher quality, well-organized structured answers based on preference datasets.
Full-Merged Weights: The repository provides full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

Methodology: DPO (Direct Preference Optimization) was applied over 1 epoch.
Configuration: Training utilized a learning rate of 1e-07 and a beta value of 0.1, with a maximum sequence length of 1024.
Base Model: Qwen/Qwen3-4B-Instruct-2507.
Training Data: The model was trained using the u-10bei/dpo-dataset-qwen-cot dataset.

Usage

This model can be directly integrated and used with the transformers library for inference, as it contains merged weights. Users should adhere to the MIT License of the training data and the original base model's license terms.

Overview

Model Overview

Key Capabilities

Training Details

Usage

Full Model Card (README)