bam2app/dpo-qwen-cot-merged_v1

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The bam2app/dpo-qwen-cot-merged_v1 model is a 4 billion parameter language model based on the Qwen3-4B-Instruct-2507 architecture. It has been fine-tuned using Direct Preference Optimization (DPO) to enhance reasoning capabilities, specifically Chain-of-Thought (CoT), and improve structured response quality. This model is optimized for generating aligned and coherent outputs, making it suitable for tasks requiring improved logical progression and structured answers.

Loading preview...

Model Overview

This model, bam2app/dpo-qwen-cot-merged_v1, is a 4 billion parameter language model derived from the Qwen/Qwen3-4B-Instruct-2507 base model. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a full-merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning (Chain-of-Thought): Optimized to produce more logical and step-by-step reasoning in its responses.
  • Improved Structured Output: Fine-tuned to generate higher quality and more structured answers based on preference datasets.
  • DPO Alignment: Benefits from DPO training to align its outputs with preferred response styles.

Training Details

The model was trained for 1 epoch with a learning rate of 5e-06 and a beta value of 0.1. It utilized a maximum sequence length of 1024 tokens. The training data used for DPO was u-10bei/dpo-dataset-qwen-cot. The model is released under the MIT License, with users also required to comply with the original base model's license terms.

Good For

  • Applications requiring improved logical reasoning and Chain-of-Thought capabilities.
  • Scenarios where structured and aligned responses are critical.
  • Developers seeking a DPO-optimized Qwen3-4B variant for enhanced output quality.