bam2app/dpo-qwen-cot-merged_v3

Hugging Face
TEXT GENERATIONConcurrency Cost:1Model Size:4BQuant:BF16Ctx Length:32kPublished:Mar 1, 2026License:apache-2.0Architecture:Transformer Open Weights Warm

The bam2app/dpo-qwen-cot-merged_v3 is a 4 billion parameter language model, fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using Direct Preference Optimization (DPO). It is specifically optimized to improve reasoning capabilities through Chain-of-Thought (CoT) and enhance structured response quality. This model is designed for tasks requiring improved logical coherence and adherence to preferred output formats.

Loading preview...

Model Overview

The bam2app/dpo-qwen-cot-merged_v3 is a 4 billion parameter language model derived from Qwen/Qwen3-4B-Instruct-2507. It has undergone Direct Preference Optimization (DPO) using the Unsloth library, resulting in a merged 16-bit weight model that requires no adapter loading.

Key Capabilities

  • Enhanced Reasoning: Optimized to improve Chain-of-Thought (CoT) reasoning, leading to more logical and coherent outputs.
  • Structured Response Quality: Fine-tuned to align responses with preferred outputs, enhancing the quality and structure of generated text.
  • DPO Alignment: Utilizes DPO to align model behavior with human preferences, focusing on specific response characteristics.

Training Details

The model was trained for 1 epoch with a learning rate of 3e-06 and a beta value of 0.2, using a maximum sequence length of 1024. The training data utilized was [u-10bei/dpo-dataset-qwen-cot].

Good For

  • Applications requiring improved logical reasoning and step-by-step thought processes.
  • Generating structured and high-quality responses that adhere to specific formats or preferences.
  • Tasks where alignment with preferred outputs is critical for performance.