dlstuharu/dpo-qwen-cot-merged_v2
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Feb 9, 2026License:apache-2.0Architecture:Transformer Open Weights Cold

dlstuharu/dpo-qwen-cot-merged_v2 is a 4 billion parameter Qwen3-based causal language model fine-tuned by dlstuharu. It utilizes Direct Preference Optimization (DPO) to enhance reasoning capabilities (Chain-of-Thought) and structured response quality. This model is optimized for generating aligned and coherent outputs, making it suitable for tasks requiring improved logical flow and structured answers.

Loading preview...

Model Overview

This model, dlstuharu/dpo-qwen-cot-merged_v2, is a 4 billion parameter language model based on the Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) via the Unsloth library to align its responses with preferred outputs.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): Optimized to improve the logical flow and step-by-step reasoning in its responses.
  • Improved Structured Response Quality: DPO training specifically targets better-structured and more coherent outputs.
  • Full-Merged 16-bit Weights: The repository provides the full-merged 16-bit weights, eliminating the need for adapter loading.

Training Details

The model underwent 1 epoch of DPO training with a learning rate of 1e-05 and a beta value of 0.1. It was trained with a maximum sequence length of 1024, utilizing a LoRA configuration (r=8, alpha=16) that has been merged into the base model. The training data used was u-10bei/dpo-dataset-qwen-cot.

Usage

As a merged model, it can be directly used with the transformers library for inference, supporting standard chat template application.